Generating Surrogate Keys With SSIS

A surrogate key is an auto generated value, usually integer, in the dimension table. It is made the primary key of the table and is used to join a dimension to a fact table. Among other benefits, surrogate keys allow you to maintain history in a dimension table. Despite of the their popularity, SSIS doesn’t have a built in solution for generating surrogate keys. Let’s take a look at a few alternatives in this post.

First, create the following table. We will import data from Person.Person table (AdventureWorks sample) into this table. Note PersonSK is the surrogate key.
CREATE TABLE [dbo].Person
(
[PersonSK] INT IDENTITY(1,1) NOT NULL
,
[FirstName] NVARCHAR(50) NULL
,
[LastName] NVARCHAR(50) NULL
,
CONSTRAINT PK_PersonSK PRIMARY KEY [PersonSK]
) ;

With IDENTITY()

Drag a data flow task on to the control flow and configure the OLE DB source.

Next, drag OLE DB destination and connect it to the source. Specify the connection manager, and choose Table or view – fast load as the data access mode. This performs a bulk insert and is the fastest of all.

Destination table has three columns, but source has only two columns. In the mappings page, map input and output columns for FirstName and LastName and ignore the mapping for PersonSK.

When you run the package, becasue PersonSK is an identity column in the table, SQL Server will automatically generate values for you. This solution is easy and fast, but sometimes depending on your ETL methodology, you can’t rely on IDENTITY().

With Script Component

I frequently use Script Transformation. The steps are nicely written in this post by Phil Brammer (b).  This is simply a script used as a transformation. The script generates a row number for each row passed through the data flow.

With ROW_NUMBER()

SSIS (data flow) engine generated the new row number when using script transformation. Instead, you can use ROW_NUMBER() when working with SQL Server data source to let the database engine do the work for you. This can be faster.

If you’re doing an incremental load, first find the maximum key value from the destination. I’ll use the following query. It’ll return zero if there were no rows, else it returns the maximum value.

SELECT  ISNULL(MAX(PersonSK),0) SK
FROM    dbo.Person

Add an Execute SQL Task to the control flow and set the result property to single row. Then, add a variable to hold the return value.

Next, connect a data flow task to the execute sql task. We will use the following SQL statement in the OLE DB source editor. In addition to the LastName and FirstName columns, we are using ROW_NUMBER() function to generate a unique number for every row.

SELECT
ROW_NUMBER() OVER(ORDER BY [LastName]) SK
, [LastName]
, [FirstName]
FROM
Person.Person

The query will generate numbers starting from 1 for each row, but while loading to destination we don’t want to start from 1. We want to find the maximum value in the destination and start from the next highest value. So, I’m adding the max value to every row number using parameters in the OLE DB source.


In the OLE DB destination, check the box that says Keep Identity. By do this we are asking SSIS to keep the identity values that are  generated by the source. In the mappings page, you’ll see a new input that was created in the OLE DB source. Map it to PersonSK surrogate key.

Go ahead and run the package. If it’s all good you will see cute little green tick marks like below.

Summary

In this example, we looked at different options to generate surrogate keys while loading dimensions using SSIS. We used IDENTITY() constraint in SQL Server. We talked about Script Component. Finally, we saw making use ROW_NUMBER() function. Last option is twice as fast as using the Script Component with around 20,000 rows and an index on LastName column.

Advertisements

Status Column in SSISDB Catalog Database

I occasionally review the search terms people use to get to this blog and secretly celebrate knowing my post helped them. If i know it didn’t, i add it to the list of items i should write about. This post is a result of one such thing.

Note: If you read my previous post about Logging in SSIS2012, please stop reading here. This doesn’t have any new information. I wrote this one first, but published the Logging post ahead of this and still publishing this to reach Googlers.

I assume you know about the SSISDB catalog database. Several tables/views have a column called [status] showing the status of an execution. It is an integer column with 1 through 9 as the possible values.

Description for these numbers isn’t available anywhere in the database but buried in books online like this.

Here is an example query you can use to translate the status numbers to descriptions.

USE SSISDB ;

SELECT
    
project_name  
     [status description]
    
= CASE [status]
              
WHEN 1 THEN 'created'
              
WHEN 2 THEN 'Running'
              
WHEN 3 THEN 'canceled'
              
WHEN 4 THEN 'failed'
              
WHEN 5 THEN 'pending'
              
WHEN 6 THEN 'ended unexpectedly'
              
WHEN 7 THEN 'succeeded'
              
WHEN 8 THEN 'stopping'
              
WHEN 9 THEN 'completed'
    
END
FROM  
[catalog] .[executions]
WHERE  execution_id = 30032 -- replace with your execution_id
;

 

SSISDB catalog execute status

Another example is with [Catalog].[Executable_Statistics]. Status column in this view shows the execution status of an executable. An executable is any package or a container or a task.

SELECT  e.package_name ,
e.[executable_name] ,
es.[execution_path] ,
es.[execution_duration] ,
[execution_result] = CASE es.[execution_result]
WHEN 0 THEN 'Success'
WHEN 1 THEN 'Failure'
WHEN 2 THEN 'Completion'
WHEN 3 THEN 'Cancelled'
END
FROM   
[catalog].[executables] e
INNER JOIN [catalog].[executable_statistics] es ON es.[executable_id] = e.[executable_id]
WHERE   e.execution_id = 40044

 

@SamuelVanga

T-SQL Tuesday 31: SSIS 2012 Logging

Topic for this month’s T-SQL Tuesday, hosted by Aaron Nelson (@SQLVariant) is Logging.

I like to think of logging as a Flight Data Recorder (FDR) to an aircraft. FDR records data about functioning of an aircraft. No one cares what is in there as long as the aircraft operates safely – but when bad things happen, every investigation boils down to finding and analyzing FDR data. Investigation results are used to understand what went wrong and make improvements to prevent bad things from reoccurring.

What are the similarities between FDR and Logging? No one will never spend time in reviewing logs as long as the software application is running smoothly, but when accidents happen (think performance degrades and errors) – details collected in the logs become vital for debugging the application. Logging process should collect any data that helps in troubleshooting.

Integration Services 2012 Logging

SQL Server Integration Services prior to 2012 had decent ability in implementing logging. Developers have gone above and beyond to extend the built-in functionality to come up with custom solutions, popularly known as custom logging or frameworks.

image

One of the bazillion improvements introduced in SQL Server Integration Services 2012 is Logging. Most of the custom logging we implemented earlier is now built into SSIS project deployment model. As a developer you don’t have to do any thing different.

Example

The below example project has two child packages – one for each table and a parent package executes the child packages.

image

image

Event Handlers are empty for all the three packages. In addition, there are no red arrows (failure precedence constraints, if that’s how you prefer) in any of the control flows. Showing this will take up quite a bit of space here so you’ll have to trust me on.

This means I’m not implementing any error handling or logging techniques in the packages.

Next, I executed the ETL Master package with logging level as Verbose. SSIS provides different logging levels. Matt Masson wrote an article explaining what events are included in different log levels. Verbose log level logs everything.

image

Let’s run a few queries against the SSISDB database to see what’s logged.

Execution

Every execution gets a unique execution_id. The following query retrieves the status and duration of the execution.

SELECT  [project_name] ,
[package_name] ,
[status description] = CASE [status]
WHEN 1 THEN 'created'
WHEN 2 THEN 'Running'
WHEN 3 THEN 'canceled'
WHEN 4 THEN 'failed'
WHEN 5 THEN 'pending'
WHEN 6 THEN 'ended unexpectedly'
WHEN 7 THEN 'succeeded'
WHEN 8 THEN 'stopping'
WHEN 9 THEN 'completed'
END ,
Duration = DATEDIFF(ms, start_time, end_time)
FROM    [catalog].[executions]
WHERE   [execution_id] = 40044

image

Executable and Executable Statistics

Below query gives you duration and status of all the executables.

SELECT  e.package_name ,
e.[executable_name] ,
es.[execution_path] ,
es.[execution_duration] ,
[execution_result] = CASE es.[execution_result]
WHEN 0 THEN 'Success'
WHEN 1 THEN 'Failure'
WHEN 2 THEN 'Completion'
WHEN 3 THEN 'Cancelled'
END
FROM   
[catalog].[executables] e
INNER JOIN [catalog].[executable_statistics] es ON es.[executable_id] = e.[executable_id]
WHERE   e.execution_id = 40044

image

Component Phases

This applies to data flow tasks. See which component phases take up most of the time.

WITH    cte
AS ( SELECT   [package_name] ,
[task_name] ,
[subcomponent_name] ,
[phase] ,
Duration = DATEDIFF(ms, start_time, [end_time])
FROM     [catalog].[execution_component_phases]
WHERE    [execution_id] = 40044
)
SELECT  [package_name] ,
[task_name] ,
[subcomponent_name] ,
[phase] ,
Duration = SUM(Duration)
FROM    cte
GROUP BY [package_name] ,
[task_name] ,
[subcomponent_name] ,
[phase]
ORDER BY [package_name] ,
Duration DESC

image

Execution Data Statistics

How many rows are transferred? How many buffers are used?

SELECT  [package_name] ,
[task_name] ,
[dataflow_path_id_string] ,
rows_sent = SUM([rows_sent]) ,
buffer_count = COUNT(*)
FROM    [catalog].[execution_data_statistics]
WHERE   [execution_id] = 40044
GROUP BY [package_name] ,
[task_name] ,
[dataflow_path_id_string]

image

And the list goes on…

I first read about SSIS Logging in Denali from an excellent article written by Jamie Thomson (b|twitter). Jamie also put together a bunch of reports called SSIS Reporting Pack on CodePlex. They are free and you must use them in conjunction with SSIS 2012.

I’m @SamuelVanga on Twitter.