Konstantin Gorakine's BI Lingual Blog

Just another WordPress.com site

Archive for the ‘SSIS’ Category

Subquery with Lookup transformation in SSIS

leave a comment »

Sometimes it is required to run additional subquery somewhere in a middle of the Data Flow Task. Suppose that we have some complex SQL select logic at the Data Flow Source when we initially getting hold of our data, then a few transformations down the way (maybe a conditional split or two, merge or union) and suddenly we realize that we need to support our next transformation logic based on the results of the subquery that we need to run exactly now, at this point and not at the source or anywhere before. I was looking into posts explaining on how to implement a subquery but they were all utilizing Script Component or stored procedure’s OUTPUT parameters.  Wouldn’t it be nice to implement this quite a common requirement on the fly without complex coding and writing stored procedures.

It can be done with Lookup transformation. Let’s assume that incoming data has Department ID column with unique values per row and we want to return number of counts for each Department ID in the flow. Open Lookup Transformation Editor and set No Cache option since we don’t need to cache the reference data set. Then, in Connection tab the following query is submitted.
 

select DepartmentID, count(*) DepartmentIDCount from OurTable

where DepartmentID = ‘000000’ –dummy department

group by DepartmentID

This query returns nothing and where clause submits the dummy Department ID. The query is required only for Lookup internal metadata generation which will allow us to connect input and referenced data sets in the Columns tab

The real query is provided on the Advanced tab and parameterized by Department ID

 

select DepartmentID, count(*) DepartmentIDCount from OurTable

group by DepartmentID where DepartmentID = ?

 

In the run-time a valid Department ID is provided and query returns the count of rows for each Department

This approach is not optimized since no caching involved, but can be suitable for a limited number of rows and in situations when Data Flow is complex or quick subquery is required.

Advertisements

Written by Konstantin Gorakine

March 4, 2010 at 4:18 pm

Posted in SSIS

Fact table load failure. Business keys logging

with one comment

When we load data into data mart, the usual sequence of events is:

 

  1. Update dimension tables with new or modified records. If a new record is added, a new surrogate key is generated in the dimension table. (Business keys are stored in dimension table as well to historically group surrogate keys (SCD Type 2) and provide surrogate key lookup functionality during fact table load process)
  2. Load fact table: first, select business keys along with measures from data source(s), then replace business keys with surrogate keys by looking them up from dimension tables and finally, load surrogate keys along with measure data into the fact table. 

Sometimes, when a data warehouse fact table is in process of loading, the SSIS Lookup Transform (if used) fails to match business key and doesn’t return surrogate key from looked up dimension table. This may happen in the case of the early arriving facts (inferred records) or errors during dimension load process. Naturally, users want to track those faulty business keys and often request to implement the logging framework which will persist relevant record metadata for further analysis. Knowing that many fact tables with different granularity and number of keys exist even in the simplest data warehouse environment, this solution could require a round of thought. The following strategic questions should be answered as early in development stage as possible:

 

  • Should we create multiple error log tables(one per fact table) or just one to accommodate it for all existing fact tables?
  • What information should be stored in the error log table(s) to provide the most usable content?
  • How do we populate the error table(s)? 

Here, I want to share my solution of the business key logging functionality which provides all necessary metadata for a data problem identification, has low maintenance overhead and straightforward development approach.

 

Lets say we have a small data mart with only three dimensions and one fact table. The table structure is very simple:

 

CREATE TABLE [dbo].[_DimProduct](

      [ProductID] [int] NULL,

      [ProductBusinessKey] [varchar](10) NULL,

      [ProductName] [varchar](50) NULL

)

 

CREATE TABLE [dbo].[_DimLocation](

      [LocationID] [int] NULL,

      [LocationBusinessKey] [varchar](10) NULL,

      [Location] [varchar](50) NULL

)

 

CREATE TABLE [dbo].[_DimDate](

      [DateID] [int] NULL,

      [DateBusinessKey] [varchar](10) NULL,

      [Date] [varchar](20) NULL

)

 

CREATE TABLE [dbo].[_FactProductLocationDate](

      [ProductID] [int] NULL,

      [DateID] [int] NULL,

      [LocationID] [int] NULL,

      [Sum] [int] NULL

)

 

Additionally, we need a Fact Source table (in a real life, that would be a number of Source OLTP tables with many joins between them). This table will play role of the real data source and will consist of business keys and measures only. Our task is to bring this data into our data mart by performing surrogate key lookup on each business key column and then loading surrogate keys and intact measures into our fact table.

 

When the SSIS Lookup transform is configured to redirect current row on Lookup Error, it will return NULL for the requested fields(in our case: surrogate key only) when business key match is not found, and then it will redirect the entire row to the Lookup Error Output path.

 

CREATE TABLE [dbo].[_FactSource](

      [ProductBusinessKey] [varchar](10) NULL,

      [DateBusinessKey] [varchar](10) NULL,

      [LocationBusinessKey] [varchar](10) NULL,

      [Sum] [int] NULL

)

 

Now it’s time to design our Error Log table(s). We need to decide, whether it will be one or many error log tables. My thought was that creation one error log table for each fact table will introduce schema navigation overhead, so my decision is to stay with only one error log table per data mart. Since this will be only one table, it needs to be flexible enough to accommodate any error data potentially coming from different fact tables.

 

Let’s talk about the set of fields this table should have. We want to know exactly which business key failed and its value, name of the Lookup transform redirected the erroneous row, name of the package where it occurred and time of the accident. The following columns will be added to the _ErrorLog table:

 

  1. ErrorLogID – identity column: will be generated by the _ErrorLog table
  2. FailedBusinessKey – this column should be able to store any value of the faulty business key, and business key can be of different types (usually String or Numeric), so I made it String to make it flexible
  3. FailedBusinessKeyColumn – stores name of the business key column ("LocationBusinessKey", "ProductBusinessKey", etc)
  4. FailedLookupName – stores name of the SSIS Lookup transform
  5. FailedDimensionTable – stores name of the dimension table that failed to return surrogate key.
  6. PackageName, PackageRunDate and UserName – additional metadata about the SSIS package that encountered the lookup problem 

Here is the final structure of the ErrorLog table:

 

CREATE TABLE [dbo].[_ErrorLog](

      [ErrorLogID] [int] IDENTITY(1,1) NOT NULL,

      [FailedBusinessKey] [varchar](10) NULL,

      [FailedBusinessKeyColumn] [varchar](50) NULL,

      [FailedLookupName] [varchar](50) NULL,

      [FailedDimensionTable] [varchar](50) NULL,

      [PackageName] [varchar](50) NULL,

      [PackageRunDate] [datetime] NULL,

      [UserName] [varchar](50) NULL

)

 

This is how the SSIS package looks after Lookup Error Logging functionality has been added to it:

 

 

All Lookups are configured to redirect bad records to the Error Output, each followed by a Unique Derived Column transform. Let’s look at one of them and see what fields it’s adding to our Error Output data flow

 

Inside of the "Product Unique columns" Derived Column transform we have set of custom fields, each related to the same _ErrorLog table columns

 

  1. FailedBusinessKeyColumn – hardcoded value, provides name of the source column that contained failed BusinessKey. In our example it is "ProductBusinessKey"
  2. FailedLookupName – hardcoded value, provides name of the failed Lookup transform
  3. FailedBusinessKey – provides value of the failed Business Key. This field is a String type since business keys can be of different types and String type is the most flexible one.
  4. FailedDimensionTable – hardcoded value, provides name of the dimensional table failed to provide valid Surrogate Key. 

As you can see, 3 out of 4 fields are hardcoded, but gives us very detailed information about the error, and definitely worth of time creating them. Of course, all hardcoded values need to be changed for each Unique Derived Column transform accordingly to provide specific information about each Lookup.

 

Since Derived Colum task produces new data fields of the Unicode type, I used type cast expressions to convert Unicode to String. This will save me some time, package space as well as buffer size by not using additional Data Conversion transform.

 

Since all our derived columns generated by Unique Derived Column trasforms have the same names, we can easily map all outputs by using Union All transform.

 

Right after Union All transform, I used additional Derived Columns transform which adds more fields to our Error Output data flow which will map to the rest of the _ErrorLog table columns

 

  1. PackageName – name of the package encountered lookup problem.
  2. PackageRunDate – date and time of the package run when lookup problem occurred
  3. UserName – account user name that run SSIS package 

All that information is provided from the SSIS package System variables: [System::PackageName], [System::StartTime] and [System::UserName] respectfully.

 

 

Now, if we run our package loading good and erroneous data, you can see that some rows were redirected to the Error Output path and were stored in the ErrorLog table

 

 

Now, we know exactly where the problem occurred: package, lookup transform, table, field and even its value (user name too) that prevented us for successful fact data load.

 

Conclusion

It is almost always necessary to provide data logging capabilities during data warehouse load process. Data analysts, DBA’s and business users need functionality to trace back erroneous data and perform data validation tasks as soon as possible. The more detailed information they have, the less time will be required for them to address the problem at hand. In this article, I provided simple, yet flexible, step-by-step solution that allows us to store metadata related to problematic data records which fact table load process may encounter.

 

Written by Konstantin Gorakine

November 10, 2009 at 2:21 pm

Posted in SSIS