How Do You Ensure Data Integrity in Cubes?

A friend of mine recently asked for assistance on creating a custom DLL. I’m a total n00b at this. But, i asked him, why he was pulling his hair over this.

He wants the DLL to implement data integrity in the cube. For example, in a customer dimension with SSN as an attribute, he was hoping, a DLL can be created to ensure that SSN is always 9 digit number.

I had no clue. However, I’ve suggested him to use constraints in the sql table, source data warehouse itself. Check constraint on SSN column can get the job done.

There are many benefits of using constraints in the data warehouse for this particular case:

  • Check constraints are simple. Even people like me, with no knowledge on DLL’s can maintain them.
  • They ensure that data in the warehouse is clean. You don’t want to allow 15 digit number for SSN, for example.
  • Only the data warehouse contains source data for cubes. And if data here is clean, there is no way bad data is entered into the cube. At least in the perfect world.
  • With clean data in the warehouse, reports can consume data from either warehouse or cubes.

What’s your approach?

What would be your approach to this problem? Did you create a DLL? What other solutions do you have?

You are welcome to post a comment below.

~Sam.

Advertisements

Workspace database server was not found

You might constantly see a warning message that appears like the one below when creating Analysis Services Tabular projects. It basically says, workspace database server ‘ServerName’ was not found.

image

You’ll have to change this setting from the model properties. The server should be an Analysis Services Server running in Tabular mode.

image

You’ll have to deal with this every time. It’s such a pain. Right? Fixing it for good is easy. Simply click on Tools, go to Options and expand Analysis Services. Change default workspace server and default deployment server to an Analysis Services server instance that’s running in tabular mode.

image

image

image

~Sam.

Why Isn’t There A Deployment Server Edition For SSIS Projects?

Andy Leonard blogged about a gotcha when creating SSIS projects using SSDT yesterday. He showed that you can use any component in SSDT whether or not your SQL Server supports that component. For example, you can create a package with Fuzzy Lookup and deploy to a server running Business Intelligence edition (Fuzzy Lookup isn’t supported on this edition), you won’t even be warned until the package bombs when executed from command line or SQL Agent.

Rafi Asraf made a comment on that post about how it plays well with SSAS projects. I’ll try to elaborate that comment here.

There is a property called Deployment Server Edition for SSAS projects.

SSAS Multidimensional.

DeploymentEdition_SSASMulti

SSAS Tabular.

DeploymentEdition_Version

I selected Standard edition. Perspectives aren’t supported in this edition. When I try to create them, SSDT shows a warning message. This is friendly.

image

Now, why a similar property isn’t available for SSIS projects? That, my friend, is beyond the scope of my brain. Adding it will definitely save a lot of time and reduce confusion for developers.

~Sam.

Follow Sam on Twitter – @SamuelVanga

Distinct Count in Analysis Services

Business users are often interested in finding distinct counts in addition to other aggregations. They may want to know the number of distinct customers calling customer support, the number of distinct employees selling business, or the number of distinct products that are being sold. Note the difference between distinct count and the other aggregations usually modeled in OLAP cubes; think sum of sales, count of orders.

Requirement:

I’ll use an online forum website as an example. Assume *SiteX* has several users, and these users are allowed to post on forums daily. Given this, let’s suppose, we want to find the count of distinct users that are posting in forums on any given day.

Star Schema:

I have created a star schema with two dimension tables, DimUser to store all the users, and DimDate to store all the dates. Fact table FactForum records user activity on forums.

clip_image001

DimData and DimUser has five rows each. The following image shows the Fact data – all the posts were posted by three distinct users.

DateKey     UserKey     TotalPosts
———– ———– ———–
20120124    1           15
20111217    1           9
20110921    1           16
20120115    3           5
20111119    3           9
20111119    5           17

Cube:

Next, I’ve created a cube with dimensions and measure group from the star schema.

clip_image002

The following is a snapshot from the cube browser, showing the total posts sliced by user and date.

clip_image003

It’s a walk in the park thus far. ain’t it?

Distinct Count:

Next, I created a new measure from the measures pane, configured usage to count of rows, and source table to DimUser. This creates a measure group from a dimension, in other words, dimension DimUsers now acts as a dimension as well as a measure.

clip_image004

clip_image006

clip_image008

The above dimension usage shows that DimUsers measure group is related to DimUsers dimension. The relationship is a fact relationship.

Lastly, I added a many to many relationship between DimUsers and DimDate. Many to many requires an intermediary measure group which in this case is FactForums.

clip_image009

The following image from the cube browser shows total posts and number of distinct users posting over calendar date (dimension).

clip_image010

Conclusion:

This is one way of modeling distinct count in analysis services. I find this useful for several reasons, one of which is performance.

@SamuelVanga

Further Reading:

SSAS Distinct Count Optimization White Paper.

SSAS Processing Mode (Lazy Aggregations)

Processing mode property of a partition/measure group determines how partitions will be available to users.

Processing mode has two possible options.

  • Regular. This is the default setting. When set to regular, partitions will be available to users after data has been loaded and aggregations are created completely.
  • Lazy Aggregations. When set to lazy aggregations, partitions will be available to user queries immediately after data has been loaded. Aggregations will be created as a separate background process while users start to query the partition.

clip_image002

For those of you who likes technical lingo, Process Full will internally run Process Data and Process Index before the partition can be queried. If processing mode is set to Lazy Aggregations, partition will be released for user queries after Process Data is completed. Process Index will be executed in the background.

Faster availability of the partition comes with a price. Since aggregations don’t exist while users begin to query the partition they may experience slow performance.

~Sam.