Previous Row T-SQL (Part1/2)

Calculating running totals, difference from the previous row are common problems. The tricky part to resolve these problems is to be able to link to the previous row. This first part of a two part series in which i discuss options to solve this probelm. I start with running totals.

Test Data:

For this post i use the following script to create and populate sample data. Note that i use row constructors introduced in SQL Server 2008. Below script won’t work as is in version prior to 2008 unless inserted using multiple insert statements.

CREATE TABLE #YrSales
(
Yr INT
, SaleAmount INT
)

INSERT #YrSales
VALUES
(2007,15000)
,(
2008,10000)
,(
2009,25000)
,(
2010,20000)
,(
2011,18000)

How to calculate running totals:

Using Sub Query:

Following query uses a correlated sub query inside the select clause. Correlated sub query executes once for each row returned by the outer query. I originally learned this method from an article by Itzik Ben-Gan, who obviously doesn’t need any introduction.

SELECT
Yr, SaleAmount
,RunningTotal = (SELECT SUM(SaleAmount) FROM #YrSales WHERE Yr<= A.Yr)
FROM
#YrSales A

Suppose the first row returned is Yr = 2009; SaleAmount = 25000. Sub query performs a sum of SaleAmount for all the years less than or equals 2009 and returns 50000as RunningTotal. This continues until there are no more rows returned by the outer query.

Using Cross Join:

Using cross join is an efficient way especially when dealing with larger data sets. Cross joining the table with itself causes every row from the left table (a) to be joined to every row in the right table (b). Where clause ensures that join is performed only when Yr from the right table (b) is less than yr from the left table (a). Yr, SaleAmount from left table (a) in the select clause return Yr and SaleAmount. Sum(b.SaleAmount) returns running total.

SELECT a.Yr,
a.SaleAmount
,SUM(b.SaleAmount) AS RunningTotal
FROM #YrSales a
CROSS JOIN #YrSales b
WHERE (b.Yr <= a.Yr)
GROUP BY a.Yr,a.SaleAmount
ORDER BY a.Yr,a.SaleAmount

To better understand, execute the following queries one at a time:

SELECT
a.Yr,
a.SaleAmount
FROM
#YrSales a
CROSS JOIN
#YrSales b

SELECT
a.Yr,
a.SaleAmount
FROM
#YrSales a
CROSS JOIN
#YrSales b
GROUP BY
a.Yr,a.SaleAmount

SELECT
a.Yr,
a.SaleAmount
,SUM(b.SaleAmount) RunningTotal
FROM
#YrSales a
CROSS JOIN
#YrSales b
WHERE (b.Yr <= a.Yr)
GROUP BY
a.Yr,a.SaleAmount
ORDER BY
a.Yr,a.SaleAmount

Advertisements

SCRIPT: Basic Information About Indexes

I’m having a little fun with documenting basic information about indexes in my current project. I’m posting the scripts here mostly for me to come back when I need them in the future, but hopefully some of you might find them useful.

Find tables without any index:

This script gives you the list of tables that don’t have any index.

; WITH A
AS
(
SELECT SchemaName = OBJECT_SCHEMA_NAME(t.OBJECT_ID) ,
TableName = t.name
FROM   sys.tables t
WHERE  OBJECTPROPERTY(t.OBJECT_ID,'TableHasIndex') = 0
)
SELECT SchemaName ,
TableName ,
TwoPartName = SchemaName+'.'+TableName
FROM A

Find tables without a clustered index:

This script gives the list of tables without a clustered index.

; WITH B
AS
(
SELECT SchemaName = OBJECT_SCHEMA_NAME(OBJECT_ID) ,
TableName = OBJECT_NAME(OBJECT_ID)
FROM sys.indexes
WHERE index_id = 0
AND OBJECTPROPERTY(OBJECT_ID, 'IsUserTable') = 1
)
SELECT SchemaName ,
TableName ,
TwoPartName = SchemaName+'.'+TableName
FROM B
ORDER BY SchemaName

Find indexes on a table along with columns covered:

Use this script to get the index names and a comma separated list of columns included in the index.

; WITH C
AS
(
SELECT TableName = t.name
, IndexName = i.name
, ColumnName = c.name
FROM   sys.tables t
INNER JOIN
sys.indexes i
ON t.OBJECT_ID = i.OBJECT_ID
INNER JOIN
sys.index_columns ic
ON ic.OBJECT_ID = i.OBJECT_ID
AND ic.index_id = i.index_id
INNER JOIN
sys.columns c
ON c.OBJECT_ID = t.OBJECT_ID
AND c.column_id = ic.column_id
)
SELECT DISTINCT TableName, IndexName, ColumnNames
FROM cte A
CROSS APPLY
(
SELECT ColumnName + ', '
FROM cte B
WHERE A.TableName = B.TableName AND A.IndexName = B.IndexName
ORDER BY TableName, IndexName
FOR XML PATH('')
)
D (ColumnNames)

Notice that I generously used CTE’s here. I like to reuse column aliases instead of repeating length expressions everywhere. CTE allows you to do that.

@SamuelVanga

Be careful with two digit years

I recently had a scary conversation with a user, which can be broken down to:

User: Why would the data have customers with a birth date of MM/DD/2020?

Me: What are you talking about? Lemme take a look.

Me again, this time with a changed voice: Umm, Umm, i plead guilty with an explanation your honor!

User: Let’s hear your explanation son.

Me: We receive a raw file with customer data. Date of birth is a field in that file, and it is supposed to contain data in ‘DD-Month-YYYY’ format. But, apparently, one fine day the file contained two digits for year instead of four. My import process decided that the two digit year 20 is 2020 and not 1920.

User: Okay, I’ll let you go this time, but fix it. Won’t you?

<end of story>.

Turns out, SQL Server automatically uses current century for two digit years less than 50, and previous century for years greater than or equal to 50.

SELECT CAST('01-January-49' AS DATETIME) [<50]

Result:

<50
———————–
2049-01-01 00:00:00.000

(1 row(s) affected)

SELECT CAST('01-January-50' AS DATETIME) [>50]

Result:

<50
———————–
1950-01-01 00:00:00.000

(1 row(s) affected)

Here is the books online explanation for the two digit year cutoff. Benjamin Nevarez (t|b) pointed me to this link when i asked for help on twitter using the #sqlhelp hashtag. Thanks Benjamin.

Cheers!

@SamuelVanga.

You will also like If i learn something, sorry, but you have to read it!

LAST_VALUE() Analytic Function–SQL Server 2012

Earlier, i discussed FIRST_VALUE() function. FIRST_VALUE() returns the first value from an ordered set of records. Similarly LAST_VALUE() returns the last value for an ordered result set, but you are in for a bit of a surprise if you aren’t familiar with the enhancements to the SQL windows functions.

I’ll use the same example data i used earlier.

CREATE TABLE Customers
(CustId INT, OrderDate DATE, OrderAmount MONEY) ;

INSERT Customers VALUES
(1, ‘2011-12-03’, 12500)
, (
1, ‘2011-10-07’, 15000)
, (
1, ‘2011-09-12’, 16000)
, (
2, ‘2011-07-11’, 900)
, (
2, ‘2011-09-13’, 18000)
, (
2, ‘2011-11-03’, 7000)
, (
3, ‘2011-01-29’, 19000)
, (
3, ‘2011-04-19’, 1000)
, (
3, ‘2011-07-09’, 12000) ;

We wrote a query using FIRST_VALUE() to calculate the date of first purchase for each customer.

SELECT
c.CustId, c.OrderDate, c.OrderAmount
, FIRST_VALUE(OrderDate) OVER(PARTITION BY CustId ORDER BY OrderDate) CustomerSince
FROM
Customers c

Now, let’s say we want to know the recent purchase date for each customer. Similar to FIRST_VALUE(), there is a LAST_VALUE() function. LAST_VALUE() allows you to return the last value from an ordered result set.

I’ll follow my killer instincts and replace FIRST_VALUE() with LAST_VALUE() in the above query. It should work. Right?

SELECT
c.CustId, c.OrderDate, c.OrderAmount
, LAST_VALUE(OrderDate) OVER(PARTITION BY CustId ORDER BY OrderDate) CustomerSince
FROM Customers C

Result Set:

… but, oops, this is not the result i was expecting.

There are two points to note here. Window and Frame.

A window is a set of records that are defined by the OVER() clause. Because of the PARTITION BY Custid, you can safely think records for each Custid form a window in this example.

A Frame is a range of rows within a window; a subset of a window in other words, defined by the ROWS clause. A range has boundaries – a begin position and an end position.

Look back at the query using LAST_VALUE(). You’ll notice there wasn’t a ROWS clause. We haven’t defined a frame. In this case SQL Server uses the default frame which is, between the first row without any boundaries (first row of the window partition) and the current row. Last value from this default frame will always be the current row.

Correct usage of the function should instead be:

SELECT
c.CustId, c.OrderDate, c.OrderAmount
, LAST_VALUE(OrderDate) OVER(PARTITION BY CustId ORDER BY OrderDate ROWS BETWEEN  CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM Customers C

Given all this, FIRST_VALUE() is providing expected results even with the default frame because the first row in the default frame is the first row of the window partition.

Here is an interesting article from Jeremiah Peschka (blog) for further reading: Leaving the Windows Open.

~Sam

Follow Sam on Twitter – @SamuelVanga


FIRST_VALUE() Analytic Function–SQL Server 2012

Given a customer table:

CREATE TABLE Customers
(CustId INT, OrderDate DATE, OrderAmount MONEY) ;

INSERT Customers VALUES
(1, ‘2011-12-03’, 12500)
, (
1, ‘2011-10-07’, 15000)
, (
1, ‘2011-09-12’, 16000)
, (
2, ‘2011-07-11’, 900)
, (
2, ‘2011-09-13’, 18000)
, (
2, ‘2011-11-03’, 7000)
, (
3, ‘2011-01-29’, 19000)
, (
3, ‘2011-04-19’, 1000)
, (
3, ‘2011-07-09’, 12000) ;

Let’s say, we want to retrieve all records along with a column showing the customer’s date of first purchase.

We could start by computing the date of first purchase for each customer.

SELECT CustId, MIN(OrderDate) FirstPurchaseDate
FROM   Customers
GROUP BY
CustId

Result set has one row per each customer.

image

Joining this to the original table will return the expected results.

SELECT c.*, firstD.FirstPurchaseDate
FROM   Customers c
INNER JOIN
(
SELECT CustId, MIN(OrderDate) FirstPurchaseDate
FROM   Customers
GROUP BY
CustId
)firstD
ON firstD.CustId = c.CustId

Result set:

image

The query plan uses a nested join because of the join we were using the query above.

image

We can achieve the same results using the FIRST_VALUE() function.

SELECT
c.CustId, c.OrderDate, c.OrderAmount
, FIRST_VALUE(OrderDate) OVER(PARTITION BY CustId ORDER BY OrderDate) CustomerSince
FROM
Customers c

[PARTITON BY CustId], creates a partiton for each customer. [ORDER BY OrderDate] sorts each partition by the OrderDate. Now, from this ordered result set, FIRST_VALUE() returns the first value for each partition.

The query is easier to read and a join is not needed. Result: the nested loop is eliminated in the query plan. This could improve performance.

image

I’ll conclude with a question for you to understand LAST_VALUE(). From the same fictitious customer data, let’s say we want to compute the recent purchase date (as opposed to first purchase date from above example) of every customer. What would be the query? I welcome your response in the comments below.

@SamuelVanga

Column alias in order by and where

If you ever wonder why column alias can be used in order by and not in where clause, it’s the logical query processing order. Pinal Dave blogged about the complete order here.

SELECT
Name n, GroupName g
FROM  HumanResources.Department D
WHERE g = 'Executive General and Administration'

In the above query, from clause is executed first followed by where clause, and select goes last. So where clause is trying to use a column g which doesn’t exist until select completes execution and thus resulting in a “invalid column g” error.

SELECT
Name n, GroupName g
FROM  HumanResources.Department D
ORDER BY g

But order by in this above query is executed at the end after select has created a column g.

And order by lived happily ever after >:]

~Sam.