How Do You Ensure Data Integrity in Cubes?

A friend of mine recently asked for assistance on creating a custom DLL. I’m a total n00b at this. But, i asked him, why he was pulling his hair over this.

He wants the DLL to implement data integrity in the cube. For example, in a customer dimension with SSN as an attribute, he was hoping, a DLL can be created to ensure that SSN is always 9 digit number.

I had no clue. However, I’ve suggested him to use constraints in the sql table, source data warehouse itself. Check constraint on SSN column can get the job done.

There are many benefits of using constraints in the data warehouse for this particular case:

  • Check constraints are simple. Even people like me, with no knowledge on DLL’s can maintain them.
  • They ensure that data in the warehouse is clean. You don’t want to allow 15 digit number for SSN, for example.
  • Only the data warehouse contains source data for cubes. And if data here is clean, there is no way bad data is entered into the cube. At least in the perfect world.
  • With clean data in the warehouse, reports can consume data from either warehouse or cubes.

What’s your approach?

What would be your approach to this problem? Did you create a DLL? What other solutions do you have?

You are welcome to post a comment below.

~Sam.

Advertisements

PASS Summit 2012 Twitter Dashboard

As a business intelligence developer, my skills include taking the raw data, shaking it to remove all the junk, and presenting it to business users in the most sexiest way possible – smart people often use the words ETL, and BI to define this process, but I’m not smart – I neither have glasses nor a French beard. ^_^

Since I missed the Pass Summit 2012, I closely watched twitter feed for #sqlpass. Oh boy! That was amazing! There are believers (#PASSPrayers), first timers (#sqlfirsttimers), karaoke’rs (#sqlkaraoke), repeating offenders (Alumni), runners (#sqlrun) to name a few, making This a perfect family (#sqlfamily).

That was me watching #sqlpass tweets btw
OMG! So close.

Then, I put my skills to work. Through out the summit I collected tweets that were coming in with #sqlpass and other hashtags mentioned above, imported them to a database, and created the following dashboard.

See PASS Summit 2012 as it happened on twitter. Click the image below to interact (opens in a new window).

PASS Summit 2012 Twitter Dashboard - sqlpass

Charts on the top show the top tweeters and the busiest hours. Times are shown in Pacific Standard Time. Clearly, people tweeted more during the keynote sessions and there are cool people in the top tweeters list.

When you select a tweeter, all views in the dashboard are filtered for that particular user. For example, in the below image I select BrentO (Brent Ozar). You can see everything Brent tweeted and tweets over time.

image

Similarly, you can select a time to see everything for that hour.

Last functionality I show is searching tweets. Enter a term in the search tweets filter area in the bottom left. For example, I enter ‘Boring’ to get this:

image

Good news: there are only a few tweets that reflect a boring event. Bad news: there are tweets suggesting that the event (at some point) is boring. You can see when attendees felt boring. Don’t get me wrong – It’s not my intention to convey that Summit was boring. A good BI solution tells business users when and where there are problems.

Database Backup

Download the database backup from here. It’s a SQL Server 2012 RTM database. There are three tables:

dbo.Hashtag contains all the hashtags that were collected. HashtagID is an identity column.

image

dbo.Tweep is a table to store tweeters. UserID is the unique Twitter ID for a user, it’s not Identity in the database.

image

dbo.Tweet is used to store tweets. TweetID in this table is again an unique ID generated by Twitter. Note that TweetID alone isn’t unique in this table. Primary key is TweetID + HashtagID.

image

This query removes duplicates and gives you unique tweets. Change the filter conditions as you like.

USE HashtagArchive ;

WITH cte
AS
(
SELECT tp.UserHandle
      
, tw.CreateDate
      
, tw.Tweet
      
, tw.Latitude
      
, tw.Longitude
      
, PST = DATEADD(HH,-8,CreateDate)
       ,
RowNum = Row_Number() OVER( Partition BY tw.TweetID ORDER BY (SELECT NULL))
FROM   dbo.Tweet tw
      
INNER JOIN dbo.Tweep tp
      
ON tw.UserID = tp.UserID
      
INNER JOIN dbo.Hashtag ht
      
ON ht.HashtagID = tw.HashtagID
WHERE  ht.Hashtag IN ('#Summit12', '#SqlPass','#Sqlrun','#sqlkaraoke','#sqlfamily','#ssisprecon','#PASSPrayers')
       AND
CreateDate > '2012-11-07 08:00:00.000'
)
SELECT UserHandle
      
, CreateDate
      
, PST
      
, [Date] = CAST(PST AS DATE)
       ,
[Time] = CONVERT(VARCHAR(8),PST,108)
       ,
Hr = DATEPART(HH,PST)
       ,
Tweet
      
, Latitude
      
, Longitude
FROM cte WHERE RowNum = 1
ORDER BY CreateDate ;

 

There is some noise with the #FirstTimers hashtag. I didn’t think young US citizens who voted for the first time would be on Twitter, strange! Make sure to avoid that and you’ve an archive of Twitter database for PASS Summit 2012.

Follow me and leave a comment

Too much to ask, I know! If you enjoyed this post, follow me on twitter (@SamuelVanga) – I’d love to get to know you. And leave a comment with what you discover from this dashboard.

Enjoy!

Related Posts

SQL Saturday US Map – Posted 08/03/2012

Update: 11/12/2012 08:00 PM

Chris Webb (@Technitrain on Twitter), MVP and BI Expert posted  an example using NodeXL to analyze tweets. Check it out – It’s amazing!

BI Pyramid (read categories of users in a BI solution)

Purpose of any business intelligence solution is to provide users information and enable them to make intelligent decisions that benefit their business. But one size solution doesn’t fit all.

Depending on their job duties and where they fall in the management chain, users can be categorized into at least three groups. A successful BI solution will come in different flavors to fulfill the requirements of these different users.

A pyramid resembles the users in an organization.

Operations: Draw two imaginary lines horizontally across the pyramid dividing it into three sections. The bottom section is the widest, this can be attributed to a group of users that are responsible for day-to-day operations, running business as usual (BAU). These users will be interested in transactions and detail data. Report requirements from these users are predictable, predefined and standard. Predefined or canned reports from components like SSRS, Excel are created for this purpose.

Analysts: The next level up is a group that can be called data analysts or middle management. Identifying trends, detail business analysis, strategic decision-making is carried out by this group. They usually start at a place, drill down into more details, slice and dice the data across different areas. Report requirements from this group are unpredictable. Ad-hoc reporting capabilities are to be provided to them by using tools like report builder, excel pivot tables, power pivot for excel.

Executives: Last but not the least top section of the pyramid represents the key group of users that make executive decisions. Users like sales managers, execs. They are highly interested in summary information like yearly sales amount, total items sold etc., In addition these users can’t spend a lot of time to go through all the reports and get what they want. Dashboards will provide executives the ability to get results fastly.

In summary

Canned Reports for Operations
Ad hoc reports for Middle Management
Dashboards for Executives

~Sam