How Do You Ensure Data Integrity in Cubes?

A friend of mine recently asked for assistance on creating a custom DLL. I’m a total n00b at this. But, i asked him, why he was pulling his hair over this.

He wants the DLL to implement data integrity in the cube. For example, in a customer dimension with SSN as an attribute, he was hoping, a DLL can be created to ensure that SSN is always 9 digit number.

I had no clue. However, I’ve suggested him to use constraints in the sql table, source data warehouse itself. Check constraint on SSN column can get the job done.

There are many benefits of using constraints in the data warehouse for this particular case:

  • Check constraints are simple. Even people like me, with no knowledge on DLL’s can maintain them.
  • They ensure that data in the warehouse is clean. You don’t want to allow 15 digit number for SSN, for example.
  • Only the data warehouse contains source data for cubes. And if data here is clean, there is no way bad data is entered into the cube. At least in the perfect world.
  • With clean data in the warehouse, reports can consume data from either warehouse or cubes.

What’s your approach?

What would be your approach to this problem? Did you create a DLL? What other solutions do you have?

You are welcome to post a comment below.

~Sam.

7 Free Tools Every Database and BI Developer Must Have

It was Thanksgiving in the US last week. As a way to say thanks, I wanted to use this space to give a shout out to free tools that help me in my day-to-day work.

Click on the following links to know more and download.

 

PASS SQL Saturday World Map

I had the pleasure of helping Karla Landrum (@Karlakay22) for the PASS Summit 2012 by creating two dashboards: SQL Saturday events and PASS Chapters on a world map. If you attended either SQL Saturday round table or Community Zone at the Summit, you may have seen them.

I used PowerView for Excel 2013 preview to create them and deployed to Office 365 SharePoint preview. Preview license will eventually expire, and those dashboards will be gone. So I thought I’d save them on this blog.

Since PowerView dashboards can’t be published for public access, I created similar dashboards using Tableau Public. If you’re interested in SQL Saturdays, I suggest you bookmark this page, because I plan on updating this view as new events are added in the future.

Clicking on the below image opens in a new window.

SQL Saturday World Map Dashboard

Tip: Use the + and – icons on the top left to zoom in and zoom out for a better experience!

PowerView Screen Prints

And, here are the screen prints people saw at the Summit, created using PowerView.

SQL Saturday Events on a World Map:

SQLSaturdayPowerViewworldmap_thumb5

Highlight Upcoming Events:

Upcoming_thumb4

Highlights Events for a Fiscal Year:

2011_thumb3

Drilldown by Country:

US_thumb4

User Groups by Country:

UG_thumb3

You too can create these with PowerView. Here is how:

Download Excel 2013 preview from here, and the workbook I used from here. Dan English (@denglishbi) wrote an article titled PowerView meet Excel 2013 part1 and part2. That helps you get started. Dan also wrote a book on PowerView you may to check out. I haven’t read the book yet, but I heard good things.

Conclusion

I hope you enjoyed this post. If you saw these maps at the Summit, please let me know what you thought either by leaving a comment below or by sending a tweet to @SamuelVanga. I’d greatly appreciate that.

I enjoyed every bit of volunteering on this project for Karla and PASS. I appreciate the opportunity. Karla, Thank you! And thanks to Niko Neugebauer for his constant feedback to make the dashboards look better.

I’m glad I was able to help!

PASS Summit 2012 Twitter Dashboard

As a business intelligence developer, my skills include taking the raw data, shaking it to remove all the junk, and presenting it to business users in the most sexiest way possible – smart people often use the words ETL, and BI to define this process, but I’m not smart – I neither have glasses nor a French beard. ^_^

Since I missed the Pass Summit 2012, I closely watched twitter feed for #sqlpass. Oh boy! That was amazing! There are believers (#PASSPrayers), first timers (#sqlfirsttimers), karaoke’rs (#sqlkaraoke), repeating offenders (Alumni), runners (#sqlrun) to name a few, making This a perfect family (#sqlfamily).

That was me watching #sqlpass tweets btw
OMG! So close.

Then, I put my skills to work. Through out the summit I collected tweets that were coming in with #sqlpass and other hashtags mentioned above, imported them to a database, and created the following dashboard.

See PASS Summit 2012 as it happened on twitter. Click the image below to interact (opens in a new window).

PASS Summit 2012 Twitter Dashboard - sqlpass

Charts on the top show the top tweeters and the busiest hours. Times are shown in Pacific Standard Time. Clearly, people tweeted more during the keynote sessions and there are cool people in the top tweeters list.

When you select a tweeter, all views in the dashboard are filtered for that particular user. For example, in the below image I select BrentO (Brent Ozar). You can see everything Brent tweeted and tweets over time.

image

Similarly, you can select a time to see everything for that hour.

Last functionality I show is searching tweets. Enter a term in the search tweets filter area in the bottom left. For example, I enter ‘Boring’ to get this:

image

Good news: there are only a few tweets that reflect a boring event. Bad news: there are tweets suggesting that the event (at some point) is boring. You can see when attendees felt boring. Don’t get me wrong – It’s not my intention to convey that Summit was boring. A good BI solution tells business users when and where there are problems.

Database Backup

Download the database backup from here. It’s a SQL Server 2012 RTM database. There are three tables:

dbo.Hashtag contains all the hashtags that were collected. HashtagID is an identity column.

image

dbo.Tweep is a table to store tweeters. UserID is the unique Twitter ID for a user, it’s not Identity in the database.

image

dbo.Tweet is used to store tweets. TweetID in this table is again an unique ID generated by Twitter. Note that TweetID alone isn’t unique in this table. Primary key is TweetID + HashtagID.

image

This query removes duplicates and gives you unique tweets. Change the filter conditions as you like.

USE HashtagArchive ;

WITH cte
AS
(
SELECT tp.UserHandle
      
, tw.CreateDate
      
, tw.Tweet
      
, tw.Latitude
      
, tw.Longitude
      
, PST = DATEADD(HH,-8,CreateDate)
       ,
RowNum = Row_Number() OVER( Partition BY tw.TweetID ORDER BY (SELECT NULL))
FROM   dbo.Tweet tw
      
INNER JOIN dbo.Tweep tp
      
ON tw.UserID = tp.UserID
      
INNER JOIN dbo.Hashtag ht
      
ON ht.HashtagID = tw.HashtagID
WHERE  ht.Hashtag IN ('#Summit12', '#SqlPass','#Sqlrun','#sqlkaraoke','#sqlfamily','#ssisprecon','#PASSPrayers')
       AND
CreateDate > '2012-11-07 08:00:00.000'
)
SELECT UserHandle
      
, CreateDate
      
, PST
      
, [Date] = CAST(PST AS DATE)
       ,
[Time] = CONVERT(VARCHAR(8),PST,108)
       ,
Hr = DATEPART(HH,PST)
       ,
Tweet
      
, Latitude
      
, Longitude
FROM cte WHERE RowNum = 1
ORDER BY CreateDate ;

 

There is some noise with the #FirstTimers hashtag. I didn’t think young US citizens who voted for the first time would be on Twitter, strange! Make sure to avoid that and you’ve an archive of Twitter database for PASS Summit 2012.

Follow me and leave a comment

Too much to ask, I know! If you enjoyed this post, follow me on twitter (@SamuelVanga) – I’d love to get to know you. And leave a comment with what you discover from this dashboard.

Enjoy!

Related Posts

SQL Saturday US Map – Posted 08/03/2012

Update: 11/12/2012 08:00 PM

Chris Webb (@Technitrain on Twitter), MVP and BI Expert posted  an example using NodeXL to analyze tweets. Check it out – It’s amazing!

Create Several Shell SSIS Packages In Minutes Using BIML

What do i mean by shell packages? You’re spot on – that’s an excellent question to begin with. For the purpose of this post, I define shell package as an empty package that is renamed appropriately and contains all the required connection managers.

It is a common pattern among SSIS developers to create separate packages for each unit of work. For example, you will create a package for each table that needs to be loaded.

The only problem: manually creating hundreds of packages, sometimes even more, is time consuming and boring. If you think about it, first steps of creating all these packages is always the same: Right click on SSIS Packages, select New SSIS Package, rename the package, and create Connection Managers.

Wouldn’t it be nice if there is a way to automate these boring steps? I hear you screaming yes.

Package templates help you to some extent. You create a package with the required connection managers and save it as a template. The subsequent packages will automatically contain all connection managers the template has, you don’t have to add them manually to all packages. You still have to add and rename packages though.

Let’s take a look at a much better and time saving method!

I use BIML to dynamically create multiple SSIS packages for each table that needs to be loaded. All packages will be renamed consistently and will contain connection managers.

Demo:

Execute the following SQL to create source and destination databases.

CREATE DATABASE SrcDatabase ;

CREATE DATABASE DestDatabase ;

Next, create some tables in the destination database.

USE DestDatabase;
GO

CREATE TABLE dbo.DimProduct (
ProductKey INT IDENTITY(1, 1) PRIMARY KEY
,ProductCode VARCHAR(10) UNIQUE
) ;

CREATE TABLE dbo.DimCustomer (
CustomerKey INT IDENTITY(1, 1) PRIMARY KEY
,CustomerCode VARCHAR(10) UNIQUE
) ;

CREATE TABLE dbo.DimEmployee (
EmployeeKey INT IDENTITY(1, 1) PRIMARY KEY
,EmployeeCode VARCHAR(10) UNIQUE
) ;

Again, our objective is to create one package for each of the tables in the destination database.

Open BIDS, I’m using SSIS 2008R2 for this example. You’ll have to install BIDS Helper from here. Right click on the project and add New BIML File. Your solution explorer look like this:

image

Double click the BIML file and paste the following script. If you see issues while copy pasting, check this link.

<#@ template language=”C#” hostspecific=”true”#>
<#@ import namespace=”System.Data” #>

<Biml xmlns=”http://schemas.varigence.com/biml.xsd”>
<Connections>
<Connection Name=”SrcDatabase” ConnectionString=”Data Source=(local);Initial Catalog=SrcDatabase;Provider=SQLNCLI10.1;Integrated Security=SSPI” RetainSameConnection =”true” DelayValidation =”false”/>
<Connection Name=”DestDatabase” ConnectionString=”Data Source=(local);Initial Catalog=DestDatabase;Provider=SQLNCLI10.1;Integrated Security=SSPI;” RetainSameConnection =”true” DelayValidation =”false”/>
</Connections>
<Packages>

<# string DestCnStr =”Data Source=(local);Initial Catalog=DestDatabase;Provider=SQLNCLI10.1;Integrated Security=SSPI”;
DataTable tables = ExternalDataAccess.GetDataTable(DestCnStr,”SELECT t.name FROM sys.tables t”) ;
foreach (DataRow row in tables.Rows)
{ #>
<Package Name=”Load_<#=row[0]#>” ConstraintMode=”Linear” >
<Tasks>
<Container Name=”Dummy — remove container” ConstraintMode=”Linear”>
<Tasks>
<ExecuteSQL Name=”Dummy — remove task” ConnectionName=”SrcDatabase”>
<DirectInput>select 1</DirectInput>
</ExecuteSQL>

<ExecuteSQL Name=”Dummy 1 — remove task” ConnectionName=”DestDatabase”>
<DirectInput>select 1</DirectInput>
</ExecuteSQL>
</Tasks>
</Container>
</Tasks>

</Package>
<# } #>
</Packages>
</Biml>

After copying the code, right click on the BIML file and select Generate SSIS Packages.

image

Bingo!

Multiple packages with connections managers have been created in a matter of minutes.

image

You just have to delete the sequence container and proceed with package development. Why?  We created connection managers in the BIML script, but since we are not using them anywhere in the package, they won’t be included in the packages. So, I tricked it by creating a dummy task. This is still easier compared to the traditional process of creating multiple packages.

image

Don’t get me wrong. This example is probably the most basic of what BIML has to offer. I’m spending some time lately with BIML and I hope to do more posts as I learn new things.

Reference: http://bimlscript.com/Browse/Snippets

@SamuelVanga

Generate Uniqueidentifier with SSIS

If you are trying to generate Uniqueidentifier/Newid() in SSIS data flow, you will soon realize that there is no out-of-the-box transformation you could use. Fear not. Script Component can be used to create Uniqueidentifier columns.

In the below package, I’ve an OLE DB source. Then, i dragged script component to the data flow. I selected transformation as the script component type and connected green arrow from source to script transformation.

Next, open the script transformation editor. In the Inputs and Outputs page, expand Output 0, select output columns and click add column in the bottom, give it a name, select unique identifier [DT_GUID] as the data type, and click ok. I named mine UniqueId.

SSIS NEWID() Script Add Column data type

In the script page, make sure Microsoft Visual Basic 2008 is selected as the scripting language. I just like working with VB. Click edit script, then copy paste the following script.

' Microsoft SQL Server Integration Services Script Component
' Write scripts using Microsoft Visual Basic 2008.
' ScriptMain is the entry point class of the script.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<microsoft.sqlserver.dts.pipeline.ssisscriptcomponententrypointattribute()> _
<clscompliant(False)> _
Public Class ScriptMain
Inherits UserComponent

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Row.UniqueId = System.Guid.NewGuid()
End Sub
End Class

 

When you add a destination, UniqueId will be available as a new column in the mappings page.

SSIS UniqueIdentifier Mappins page New Column

@SamuelVanga

SQL Saturday US Map

Apparently, SQL Saturday isn’t popular in the west. See for yourself. I wanted to see what cities frequently host SQL Saturday and ended up creating this dashboard. There are some interesting findings. Look at how the events are primarily confined to only (right) half of United States. There are several states that haven’t yet seen a SQL Saturday!

image

I created this using Tableau Public. Unfortunately, functionality of the dashboard on this blog is limited, so I encourage you to click on this link to see an interactive map.

Which state has conducted the most events so far? That’s right, it’s the Sunshine state. Good guys in Florida are outstanding by conducting 20 events already.

I hope this post is fun. Your comments are welcome. Cheers!

I’m @SamuelVanga on Twitter.