Partilhar via


Transforming data in a data warehouse through SQL views

This post is an add on to the another post titled Designing an ETL process with SSIS: two approaches to extracting and transforming data where Phill Devey responded with the following question:

With regard to your statement " With staging tables, transformations are implemented as database views" Are you suggesting that your Dimension and Fact "tables" will be views or do you mean the query that populates the Dimension and Fact tables will be based on a view?

That’s a good question, and let me explain what I’ve found most practical.

Initially, when I create a new dimension or fact table, I simply create it as a view. Doing so allows me to quickly develop, test and debug the transformation in SSMS. If performance is not an issue, I might even deploy it this way into production.

In practice, once the transformation seems to work well, what I usually do is rename the view and create a stored procedure which creates a table using the original view. Because the new table has the same name as the original view, nothing breaks. Here’s what the stored procedure might look like:

    1: CREATE PROCEDURE [dbo].[sp_AccountFact]
    2: AS
    3: BEGIN
    4:     SET NOCOUNT ON;
    5:  
    6:  
    7:     SELECT * 
    8:     INTO #MyTempTable
    9:     FROM dbo.[Account Fact View]
   10:     
   11:     BEGIN TRANSACTION MY_TRANS
   12:  
   13:     IF  EXISTS (
   14:             SELECT [name]
   15:             FROM [sys].[tables]
   16:             WHERE name='Account Fact'
   17:     )
   18:     BEGIN
   19:         DROP TABLE tabular.[Account Fact]
   20:     END;
   21:  
   22:     SELECT * 
   23:     INTO tabular.[Account Fact]
   24:     FROM #MyTempTable
   25:  
   26:     COMMIT TRANSACTION MY_TRANS 
   27:  
   28:     CREATE UNIQUE NONCLUSTERED INDEX [Account key] ON [tabular].[Account Fact]
   29:     (
   30:         [Account key] ASC
   31:     )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
   32:  
   33:     CREATE COLUMNSTORE INDEX [NonClusteredColumnStoreIndex-20130801-145157]
   34:     ON [tabular].[Account Fact](
   35:         -- Add columns here
   36:     );
   37:  
   38: END

In this example, the stored procedure [dbo].[sp_AccountFact] does the heavy lifting of recreating the [tabular].[Account Fact] table. You would call this stored procedure during the transform phase of your ETL process.

The view with the transformation logic (which was previously called [tabular].[Account Fact] ) is now called [dbo].[Account Fact View] . The content of this view is first copied into a temporary table using the SELECT INTO statement. On line 11 we start a transaction so that the entire update is atomic. In lines 13-20 the existing table is deleted if it exists, and then in line 22 the table is recreated from the temporary table we created before, again using the SELECT INTO statement. Then the transaction is committed. Finally we recreate the indexes. Note that the second index in this example is a column store index which greatly improves the performance of most ad hoc queries.

The transaction ensures that clients performing queries on the data warehouse get valid results while the ETL is running. However, those clients might get blocked until the transaction completes. Therefore it’s important to minimize the duration of the transaction. This is why we first copy the view into a temporary table before the transaction starts. We also recreate the indexes outside the transaction for this reason.

Another design decision is whether to drop the table and recreate it – as I do here – or to just truncate the table. If we would just truncate the table, the stored procedure would become even simpler because we don’t need to check if the table already exists before dropping/truncating it, and we don't need to recreate the indexes. On the other hand, a benefit of using the drop-and-recreate method is that we only need to maintain the table’s schema in one place, i.e. in the view. In the truncate scenario we would need to update the schema of the table every time we modified the view. A second benefit is that it simplifies the usage of column store indexes. In SQL Server 2012, you cannot modify the contents of a table which has a column store index. (I believe this constraint no longer exists in SQL Server 2014, but I have not yet verified that myself).

Note: another good practice is to use schemas. In this data warehouse, the dimensional tables are in a schema called “tabular”. I also typically create a separate schema for each source system and its staging tables. The dbo schema has “internal” objects which are important for the ETL but are not visible to normal users. Normal users of the data warehouse only need read access to the tabular schema to perform the OLAP queries of their desire.

Comments

  • Anonymous
    April 21, 2015
    The comment has been removed

  • Anonymous
    April 22, 2015
    Phill - if you're used to designing transactional databases, it requires quite a mental switch from thinking in normalized data models to dimensional modelling. However, it's well worth it. Ralph Kimball's book "The Data Warehouse Toolkit" does a phenomenal job in explaining it. I highly recommend it.Here's a super short recap: for everything that you want to be able to aggregate (sum, count, etc) you need a Fact table. This fact table will have one or more columns with numeric fields (sometimes each row just has '1' for one of those fields in case you need to be able to count). Additionally the fact table has key columns which allow it to be joined to dimensions. Fact tables don't need their own primary key. (You never directly join Fact tables with other Fact tables.)Dimension tables are tables which contains things you want to slide and dice by.When you do a query on the DW, you almost always start with a Fact table in the FROM part of the SELECT clause, and then join it with the dimension tables.Often, you'll create both a Fact and a Dimension table for a given source table. For example, if you have a customer table in the source database, you would have often find a Customer Fact table with customer count, customer revenue, etc, and a Customer Dimension table with all the attributes you might want to report by.With regard to SQL Server password management - I usually use Windows integrated security. The database security must be configured such that you grant access to your own account (for dev) and to a service account (for production). On the production server, the SQL Server Agent Service is configured to run the ETL job. The SQL Server Agent service is reconfigured to run using the service account, and the job step that runs the SSIS package is configured to run as the SQL Server Agent Service Account (which now is the service account).  This way you don't need to store any passwords in the SSIS package. You do need watch out in case there's a group policy which requires the service account's password to be updated periodically. When you update the service account's password, you shouldn't forget to re-enter the password on the SQL Server Agent process as well.Hope that makes sense.

  • Anonymous
    April 22, 2015
    I'll look into the changes regarding the SQL service account.  Thanks for the help on that.  You also just answered something I've been thinking about regarding the same tables fitting into both Fact and Dimension categories.  That will help me as I begin to build the reports.I spent all morning building my first Dimension table which contains our inventory items that will link to the order Fact table in the OLAP cube.  Per your post, I built the select statement as a view because it contains 1 main table and 7 joined tables for various lookup values such as product group, unit of measure, material type etc..  Now I'm building the SSIS package that will output the view and insert into the Dimension Table.If I may, I'd just like to ask 1 more question about this.  In my Item Dimension table, is it common practice to output the ID fields of the lookup values as well as the description field for the lookup  E.g. Each item has a Material ID that Links to a materials table.  Should I include the Material ID and the Material Description in the Dimension table?  I'm trying to think ahead as it may be used to filter the result and be a way to group and drill down into the data.  Which of course, leads to another question!  Does the materials table need to be it's own dimension table to allow that type of functionality?  If so, that's going to result in 2 Dimension tables being joined together in the OLAP cube like so.Fact Orders --> Dim Items --> Dim Materials  is that even possible or allowed?  If so, I would also have to do the same thing on product group as I already know they want to analyze the data that way.Sorry for asking so many questions.  

  • Anonymous
    April 22, 2015
    Phill - there are two common approaches to dimensional modelling, known as "star diagram" and "snowflake diagram". Using the snow flake paradigm, dimension tables could be joined to other dimension tables. The way to think of that is that you're able to drill down 2 levels, first from the fact table to the first a dimension table which is directly joined to it, then from there two another dimension table.I prefer using the other approach - the star diagram - where you don't need multiple levels of dimensions.  In your example this would mean that you would already include the material ID on the fact table so you can directly join the material dimension to it. You can model it this way, even if in the source database those tables are not directly related. The SQL view which defines the fact table would already include the logic to determine the material ID for each item.In your situation, without knowing too much about your data - I would advise the following:Start by building an Item Fact table. I would include a column called "Item Count" column and other numeric values such as weight and price which are available or which you could pre-calculate in the SQL query driving the Item Fact table.Next I think you need a Order dimension, which you can join with the Item Fact table. Hence the Item Fact table should include the Order ID. I wonder if you need a Order Fact table, because you might be able to satisfy all your query requirements through the Item Fact table. You could include customer attributes, such as customer name, on the Order dimension. Generally you don't want to have more dimensions than necessary. At a later time you can always implement Customer as a separate dimension. In your case, I would think that you'd also have a Material dimension table. The Material Dimension table would include things such as material name and class. Remember that numeric properties such as weight can already be implemented on the Item Fact table.  Hence the Item Fact table would also include the foreign key to the Material dimension table. In your comment you mentioned an OLAP cube. Have you considered implementing this as a tabular cube? Tabular cubes are designed using a relational data model, so in that case you wouldn't need to design a dimensional data model.

  • Anonymous
    April 23, 2015
    The comment has been removed

  • Anonymous
    April 23, 2015
    The comment has been removed

  • Anonymous
    April 23, 2015
    Thanks again for the input.  I shall look at scaling back my Fact table as suggested however, the problem with removing the order header from the Fact table, I will lose many of the foreign keys that allow me to link to other dimension tables.  I would have to change create a Dim OrderHeader table that looks at 7 other tables such as customer, sales location, sales type etc. Then the customer table links to 7 other tables just for customer type/group information.  I can create it through the ETL process using a view (the subject of the original post), but I'm just not familiar with working with data in that way.But if that's how it should be done, then that's what I need to do.In total I'm working with 24 source tables just for Sales Analysis so you can understand how complex this is.  I'm about to add more tables to based on customer Hierarchy and Product type.Yes, this is a huge project for a beginner to BI and this is just the first of many reports that will address different areas of the business.  Unfortunately, that's how things operate where I work, I have to learn on the job and try and do the best I can without the experience.  I would hate to have tacked this without any SQL knowledge at all, at least I have my 15 years programming experience to fall on.As I'm still building my Fact and Dimension tables, I haven't been able to play around with building a cube in SSAS yet.  I did read a great article on it so I do have a general idea how it should work now.  Thanks for posting the information about it and Excel tabular cubes, that could help me a lot.

  • Anonymous
    April 28, 2015
    @Phill - I am open to the idea of writing a couple of blog posts about your project, and how I would advise to address various tasks and challenges, including dimensional modelling. This way I can put a lot more thought on solving your specific challenges, and others who visit my blog can also learn from it. It would require you to share some more specifics, such as the data schema of your operational systems. If you're interested in this, please drop me an email by clicking the "Email Blog Author" button on this page so we can discuss it further by email.