Wednesday, December 21, 2011

TFS 2010 Can't Convert Branch to Folder

Branches can be removed from a TFS project by converting them back to a folder. This is usually done when a branch was created by accident and needs to be removed. The tricky part is when the Convert to Folder menu option doesn't appear. If you find yourself in this situation you're most likely using the context menu (right-clicking). Try navigating to File -> Source Control -> Branching and Merging. You should see Convert to Folder there.

I'm not sure if it's true but a colleague of mine told me that Microsoft purposely did this to help avoid accidental conversion of branches to folders by making it harder to get to the command. It also helps make it very confusing and frustrating - IMHO.

Thursday, December 15, 2011

SQL Server 2008 R2 Master Data Services: A Review

I recently had the opportunity to spend a couple of days working through a proof-of-concept using Microsoft's SQL Server 2008 R2 Master Data Services (MDS) application. I had been curious about the product ever since I had heard about it, as it is intended to solve a problem I have often encountered during the beginning stages of numerous data warehousing projects.

The problem I am referring to is finding a source for core organizational data such as products, customers, store locations, raw material types, and business units - the data that ultimately makes up the dimensional model's dimension tables. Often times this data is spread across numerous systems or better yet lives in one or more spreadsheets that are maintained by numerous people and are missing data as well has contain duplicates. In addition, the data that is there is often out of date or simply invalid.

A data warehouse is just about worthless if the dimensions by which the measures will be evaluated are of low quality. It could even be detrimental if business decisions are being made based on invalid data.

Before moving on to a technical solution to these data issues it is very import that everyone involved in the success (or failure) of an MDM and/or data warehousing solution understand that technology will not make the problem of unreliable, disparate data go away. Technology acts as an enabling platform for well thought out, planned, and accepted data governance and stewardship. A brief definition of the two can be found here.

MDS assumes you have these things figured out and that you want to implement them on a technology platform that provides one or more of the following:
  1. Data modeling (entities, attributes, hierarchies, etc.)
  2. Data browsing
  3. Manual data maintenance (add, edit, delete) via a modern UI
  4. Bulk import and export
  5. Data and Model versioning
  6. Audit capabilities (who changed what when)
  7. Business rules for data validation
  8. Workflow based data validation
  9. Ability to integrate (both ways) with other systems
  10. Fine-grained security
It's my experience that building an MDM database is easy. It's the UI, complex data validation, workflow, and integration with other systems that's the tough part. Lucky for me Microsoft now provides a platform that does all these things for me. How well it does them is what I spent a few days trying to figure out. Before I detail what I liked and didn't like about the product I will provide a general overview as well as an overview of the underlying architecture, technology, user interface, and integration points.

General Overview

Purpose

The purpose of MDS is to provide a platform for the creation, maintenance, and integration of high quality master data throughout an organization.


Licensing

MDS comes with Enterprise and Data Center editions of SQL Server. Licensed the same way as SSIS, SSAS, and SSRS in that it can be installed free of any additional license fees on a machine that already has a paid SQL Server license. However, if installing an a server without SQL Server installed, it is necessary to pay for another license.

Hardware

MDS can only be installed on a 64-bit server.


Architecture

All model metadata and data (members) are stored in a SQL Server database.
Master Data Manager Web application and Web service layer built on ASP.NET and WCF respectively and are hosted by IIS. All web application methods ultimately pass through the WCF layer.
Windows service that acts as a broker between MDS and SharePoint in order to allow MDS business rules to use SharePoint 2010 workflows (build on WWF).This is optional.
Configuration Manager Web application that is similar to that of SSRS and allows for such things as enabling the web service and attaching to an MDS database.

Technologies
When new entities are created or modified via Master Data Manager, calls are made to the WCF based web service that ultimately result in the creation or modification of SQL Server tables, views, etc. Unfortunately the tables are created with a “code name” such as tbl_2_10_EN. However, this shouldn’t be an issue as all data access should be done via auto-generated views, exports, and/or web services. There is also a view that maps the table code names to meaningful entity names.
There are a number of CLR-based functions that are used for merging, de-duping, and survivorship based on rules created by the data stewards and/or administrators.
Service Broker is used in scenarios where many emails are being sent or large amounts of data are being loaded. This allows for asynchronous processing that frees up the users to continue using the web application.
Database Mail, which resides in the msdb database, is used to send e-mail alerts.
Staging tables are used when importing data into MDS. This allows for a load to be completed in a single batch once it has been staged without error.
IIS hosts a default named application “MDS”. It also hosts a WCF Service called Service.svc which is located in the same directory as the application. It is highly recommended that all custom processes use the web service to interact with the database.

User Interface (Master Data Services Manager)

Allows administrative users to create data models to hold entity members (term member is used the same as in dimensional modeling to refer to a row/entity instance).
Allows users to add, edit, and delete members (entity instances).
Allows for the creation of hierarchical relationships between members.
Allows for the creation of collections of members for non-mutually exclusive groups of members.
Copies of entity data and related metadata (models) can be archived to create a separate version of the data.
Includes a business rules engine that can be used to enforce data quality as well as assist with data stewardship via workflows. Both data validation and manipulation rules as well as workflows have the ability to send e-mail alerts.
A granular security model can be used to fine tune specific user’s access to certain entities, attributes, and even members (rows).

Integration Points

Database

When importing bulk data it is necessary to get that data into existing SQL Server staging tables. This can be done by making calls to out-of-the-box stored procedures and/or writing directly to the staging tables. For scheduled and programmatic imports, custom SSIS packages are recommended. Once these tables are populated you can invoke the import by using the API, web service, or the Master Data Manager user interface.

Web Service

All MDS functionality can be executed via the IIS hosted, WCF-based web service named Service. It can be used to create a custom UI and/or integrate it with existing and/or new applications.

.NET API

Although using the web service directly is recommended, MDS provides three .NET assemblies (Microsoft.MasterDataServices.Core, Microsoft.MasterDataServices.Deployment, Microsoft.MasterDataServices.Services). These are typically used for administrative functions such as creating and deploying update and deployment packages and cloning and deleting models.


Review

Installation
I got started with the tool by installing it on a VMware player virtual machine with Windows Server 2008 R2. The install was relatively easy although I did have to do some post install tinkering to get IIS configured correctly to serve the web app and WCF web service. MDS has a configuration utility that feel somewhat similar to the SSRS configuration utility. It was pretty easy to use although it has an option to enable the web service yet you still have to change the web.config file to actually enable it.

User Interface
The web-based user interface is on the right track but it is unbearably slow. It can take 15-30 seconds to respond to a simple request. I have seen others complain and responses that say that once the application caches some data things should speed up. I have seem some improvement but it's still way too slow. There are also no indicators letting you know that the application is doing something so many times I found myself wondering if I had even clicked something.

I also found myself trying to remember where to go to do certain tasks. I couldn't quite put my finger on it but the UI just doesn't seem intuitive and is definitely inconsistent in some areas. For example, many pages have data grids that look like they are for display only but if you click on a row you are presented with add, edit, and delete buttons. Additionally, some pages require the user to double-click on on a cell in a grid to edit the value. It took a while for me to figure this out as there was no indication that this was the expected behavior.

System Administration
The System Administration area of the application is where models, entities, attributes, etc. are created and maintained.

I found it easy enough to create entities and attributes although there seems to be a lack of data types for attributes. The biggest shortfall for me being the lack of a boolean data type. The workaround to this is to create a domain-based attribute with values of yes/no or true/false. Seems kind of lame to me.

There are also the concepts of derived hierarchies versus explicit hierarchies, as well as collections. It was unclear to me when to use which feature.
Attribute groups is a nice feature that allows you to group certain types of attributes such as address type attributes which ultimately groups these attributes on different tabs for end-users to maintain.

The business rules functionality is pretty nice. It lets you define complex, row-level validation and data maintenance but doesn't allow you to access any aggregate or related values in your logic. One of the actions that can be triggered is a workflow. This workflow is a SharePoint 2010 WF based workflow. My feelings are that there's not much special here as you have to develop a customer WF component and deploy it to your SharePoint server to make it work. All that MDS provides is the ability to call this workflow. This can be done with just about any application. It falls short for me.


Conclusion
As much as I want a free tool to ease my workload, MDS just doesn't feel ready for prime-time although I'm on the fence as to whether I would use it in an environment that had no other tool and the ability to install it on a machine in which a SQL Server license was already paid for. It should be noted that a new version will be released with SQL Server 2012 and promises to be a substantial improvement over the current version. 

Generate Insert Statements for SQL Server

Many times it is necessary to script the insert of rows into a SQL Server table as part of a release, bug fix, maintenance process, etc. I have done this through T-SQL, Excel, and I'm sure a few other ways.

It turns out Microsoft provides a tool to do this for us with ease. It is called the Database Publishing Wizard. This tool can script both DDL and DML. The actual executable is named SqlPubWiz.exe and can be found in a path similar to C:\Program Files (x86)\Microsoft SQL Server\90\Tools\Publishing\1.4.

Give it a shot. It's a nice little tool to have in the toolbox.

BIDS (Visual Studio 2008 Shell) with TFS 2010

BIDS is not compatible with TFS 2010. It has to do with TFS 2010 having the concept of Project Collections whereas 2005 and 2008 do not. When adding a Team Foundation Server to Visual Studio Team Explorer 2008 (Tools -> Connect to a Team Foundation Server -> Servers -> Add) there is no way to reference a TFS Project Collection. However, it is possible to get your Visual Studio 2008/BIDS environment up to speed.

There are four things you need. 
1) Visual Studio Team System 2008 Team Explorer downloadable here.
2) Visual Studio 2008 SP1 downloadable here.
3)  Visual Studio Team System 2008 SP1 Forward Compatibility Update for Team Foundation Server 2010 downloadable here.
4) Lastly you'll need to create a registry entry for each Server/Collection you want to use. Launch the registry editor (regedit at a command prompt) and create an entry in HKEY_CURRENT_USER\Software\Microsoft\VisualStudio\9.0\TeamFoundation\Servers similar to below. 

    Value name: ServerName
    Value data: http://ServerName:PortNumber/tfs/CollectionName

Wednesday, December 14, 2011

SSDT vs. VSDB

Microsoft strikes again by replacing a technology that was just beginning to gain a foothold with a product that has some nice improvements but does not replace all of the existing functionality. Here is a link to a long awaited feature comparison between the SQL Server Data Tools SQL Server database project (SSDT) and its predecessor Visual Studio Database project (VSDB).

Dimensional Modeling Skills Assessment

As everyone knows it's extremely hard to hire good people. It's equally as hard, if not harder to actually identify good people. I have previous posts with interview questions that dive into the details of T-SQL, SSRS, SSIS, SSAS, modeling, etc. Unfortunately many of these questions can be answered by those that have done a lot of reading and memorizing. It's my belief that the best technical interviews uncover a candidate's thought process, problem solving skills, and past experience. I recently put together a practical exam that attempts to do this. While I have only administered it once at this point I am confident that it is a good start to a practical skills assessment which I think should be part of any developer interview.

The assessment consists of business requirements, modeling requirements, an OLTP normalized, physical data model and some questions. The candidate is asked to review the requirements and data model and create a dimensional model that will serve as the basis for a data warehouse. There are also some follow-up questions.

The current version of the assessment asks for a dimensional model but has some questions that cross over into the physical world and assumes a SQL Server environment. It can probably be improved or at least made applicable to more organizations by making it more generic. However the position I created it for requires some knowledge of SQL Server so I incorporated it into the test.

Below is the problem statement and the physical OLTP data model. You can download the entire assessment here. Please e-mail me at david@elish.net if you'd like the answers.


You are a consultant and have been engaged by a manufacturing company to develop a business intelligence solution to help them reduce their raw material costs. Understanding the Purchase Order process which includes both purchases and receipts against those purchases is key to any potential cost reduction program.
The project’s sponsor has requested a first iteration of a Kimball-style dimensional model representing the material acquisition process. This model will be the basis of a star-schema based data warehouse.  It is your task to develop this model. 


Wednesday, November 16, 2011

SQL Server 2012 Editions

Microsoft has changed SQL Server licensing for the SQL Server 2012 release to better align with how customers are deploying applications and solutions. There will be Enterprise, Business Intelligence, and Standard editions. Below is directly from Microsoft's website.

Standard Edition still includes SSRS and SSAS but does not include the more self-service oriented BI features such as Power View (project Crescent) and Power Pivot for SharePoint Server. Data Quality Services and Master Data Services are also only available with Enterprise Edition.

Tuesday, November 15, 2011

The Excel Connection Manager is not supported in the 64-bit version of SSIS

When attempting to execute a package that makes use of an Excel Connection Manager you may be presented with the following errors.
  • SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER. The AcquireConnection method call to the connection manager "YourConnectionMgrName" failed with error code 0xC00F9304. There may be error messages posted before this with more information on why the AcquireConnection method call failed.
  • [Connection manager "YourConnectionMgrName" Error: SSIS Error Code DTS_E_OLEDB_EXCEL_NOT_SUPPORTED: The Excel Connection Manager is not supported in the 64-bit version of SSIS, as no OLE DB provider is available.
The problem is that Microsoft doesn't provide a 64-bit Excel Connect Manager component (as stated in the error). This means that to run the package it must be run using the 32-bit DTEXEC.exe. When running in BIDS (Visual Studio) it's as simple as setting a project property. Below is the default property on a 64-bit machine. Change True to False to run the package.

If you're running the package under a SQL Server Agent job you must also set a similar property on the job step that runs the package. Notice the Use 32 bit runtime property. Make sure this is checked.

Overwrite Excel File in SSIS: Workaround for Excel Connection Manager

I recently needed to create a process that would ultimately create an Excel file for import into another system. The other system was built and maintained by a third party and I had no control over it. It required that an Excel file be present in a specific directory with a specified name. The solution required that 1) if the file did not already exist it would be created and 2) if the file did already exist that it would be overwritten and a copy of the new file would be written to an archive folder with the date and time appended to the file name (so that the other historical files wouldn't be overwritten and so that it would be easy to know what date and time the file was created).

Current File Name: Extract.xls

Historical File Name: Extract_YYYYMMDD_MMSS.xls

I figured this would be easy. The first thing I did was look for the OverwriteFile or ReplaceFileproperty. What I found was that the only non-connection specific property the Excel Connection Manager exposed was FirstRowHasColumnName. At that point I figured I would experiment and find out what would happen if I tried to write to a pre-existing file. The result was that the data being written to the file was simply appended to the pre-existing data in the file. Definitely not the behavior I was looking for.

I then hit "the Google" as George W. so fondly calls it. I found solutions that proposed deleting data from worksheet(s), deleting worksheets, and creating new worksheets using the Jet 4.0 SQL syntax to "DROP" and "CREATE" tables. In Jet/Excel terms, a table is an Excel worksheet. I found that the DROP didn't actually delete the worksheet but deleted rows from the worksheet. This could have worked except that the next time the worksheet is written to, the new rows start after the last row of the previously deleted data, leaving a bunch of empty rows in the beginning of the worksheet. This approach did not cut the mustard. So ended my search for the wheel I was attempting to not reinvent.

The approach I came up with does the following.

  1. Creates a shell (only the headers) Excel file with the name Extract_wip.xls.
  2. Uses a Data Flow to populate the wip Excel file.
  3. Copies the wip file to an Archive folder but renames to Extract_YYYYMMDD_MMSS.xls.
  4. Renames the wip file to Extract.xls, overwriting the previously created Extract.xls files (if one exists).
This is a screen shot of the Control Flow.


*If you don't have an archival/historical requirement you can remove that functionality and make this solution even simpler.


Connection Managers

ExcelFile
The Connection Manager used in the final step of the process to rename the wip file to the final file name. Nothing fancy here although I would suggest making this configurable.

ExcelFileArchive
The Connection Manager used to copy the wip file to an archive folder with the date and time in the file name. Below is the expression used to generate the file name. You will want to set the Connection Manager's ConnectionString expression to this.

"C:\\ExcelExtract\\Archive\\Extract_" + RIGHT( "0000" + (DT_STR, 4 , 1252) DATEPART( "year" , GETDATE() ) , 4 ) + RIGHT( "00" + (DT_STR, 2 , 1252) DATEPART( "month" , GETDATE() ) , 2 ) + RIGHT( "00" + (DT_STR, 2 , 1252) DATEPART( "day" , GETDATE() ) , 2 ) + "_" + RIGHT( "00" + (DT_STR, 2 , 1252) DATEPART( "hour" , GETDATE() ) , 2 ) + RIGHT( "00" + (DT_STR, 2 , 1252) DATEPART( "minute" , GETDATE() ) , 2 ) + ".xls"

ExcelFileForDataFlow
This Connection Manager should point to the wip file. You'll want to have a ConnectionString expression similar to the following.

"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\ExcelExtract\Extract_wip.xls;Extended Properties=\"EXCEL 8.0;HDR=YES\";"

*HDR is whether the file has a header row.

ExcelFileWIP
This Connection Manager points to the wip file and is used to copy and rename the populated Excel file.

SomeOLEDBDataSource
This is used by the data flow. There could be one or more Connection Managers for your data flow.


Control Flow Tasks





Connection Type: EXCEL
Connection: ExcelFileForDataFlow
SQL Statement:

CREATE TABLE [ThisWillBeYourWorksheetName](
[SomeCharVal1] nvarchar(25),
[SomeCharVal2] nvarchar(255),
[SomeCharVal3] nvarchar(50),
[SomeDecimalVal1] double precision,
[SomeDecimalVal2] double precision,
[SomeIntegerVal1] integer)

*It is important that you use the JET syntax including the appropriate data types when writing your CREATE TABLE script.

This task will create a new Excel file with a worksheet named using the name in your CREATE TABLE script. It will use the file name found in the ExcelFileForDataFlow Connection Manager.

*If a file by the same name exists it will add a worksheet to that file. This is why we want to make sure this file does not exist when we start our process. Hence the wip file.





This is the Data Flow that writes to your Excel File. It doesn't matter what the data flow does, just make sure that it writes to an Excel Destination that uses the ExcelFileForDataFlow Connection Manager.





Destination Connection: ExcelFileForArchive
OverwriteDestination: True
Operation: Copy File
Source Connection: ExcelFileWIP

This File System Task copies the wip file to an Archive (or any other) folder using the Extract_YYYYMMDD_MMSS.xls naming convention.






Destination Connection: ExcelFile
OverwriteDestination: True
Operation: Rename File
Source Connection: ExcelFileWIP

This File System Task renames the wip file to the Extract.xls name, effectively overwriting the previously created Extract.xls file (if one exists).


Summary

While it seems like it would make sense for the Excel Connection Manager to allow for the overwriting of an existing file, it doesn't. The above process is the simplest way I have found to reproduce this seemingly lacking functionality.



Monday, October 31, 2011

SQL Server Default Backup Location


SQL Server makes it easy to set a server's default data and log locations. In SQL Server Management Studio, simply right click on the top level server node in the Object Explorer and select Properties. Then click on the Database Settings page selector on the left side of the screen. At the bottom of this screen there are text boxes for entering the default Data and Log locations.






Unfortunately it's not as easy to set the default backup location. To do this you need to modify the server's registry. Open the Registry Editor by typing regedit either on a command line or in the Search programs or files box at the bottom of the Windows Start Menu. Once you're in the Registry Editor navigate to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\YourInstanceName\MSSQLServer. Now enter the directory you'd like to be your default. Save and close the Registry Editor.



The next time you attempt to backup or restore a database you will be presented with your new default backup location. No more navigating to a different location over and over again.


Friday, October 21, 2011

The Future of Visual Studio Database Projects

It is with great sadness that I relay the decision by Microsoft to do away with the Visual Studio Database Project type that I love so much. This project type has given database developers the tools needed to implement high quality Database Development Life Cycle processes. Some highlights can be found here.

However, all is not lost. VSDB will be replaced by what was until yesterday referred to as project Juneau and is now officially named SQL Server Data Tools. It will also be hosted in the Visual Studio environment.


SSDT has many similarities with VSDB but is a complete rewrite from what I understand. Below are some of the features previously unavailable in VSDB.
  1. Code navigation (Go to Definition, Find All References)
  2. Enhanced code analysis and design time validation
  3. Table designer similar to SSMS
  4. Connected (live database) and disconnected (declarative) development
  5. LocalDB - SQLExpress without the need to explicitly install any SQL Server instances
  6. Entity Framework integration for a much more tightly integrated development experience between the data and application layers
  7. Support for CLR objects
  8. Support for SQL Server 2012 and SQL Azure databases (both support 2005, 2008, and 2008 R2)
  9. Integration of SSIS, SSRS, and SSAS projects into the next edition of Visual Studio (currently Visual Studio 2011)
In my opinion, of the biggest shortcomings (as of SSDT CTP3) of SSDT is that there is no longer support for database unit tests. Automated unit tests have a huge impact of quality and I believe are key to robust DDLC.

While I am sad to see VSDB go, I am confident that Microsoft will create an even better tool in SSDT. I look forward to exploring it and sharing my findings in future posts.

Wednesday, October 12, 2011

SSIS Configuration File (.dtsconfig) Behavior (relative path issue)

Let me start by saying that I prefer SQL Server configurations over all other configuration types in SSIS for the majority of configurable values in my SSIS projects because they are more secure, easier to decipher, and fit well into database deployment methods and tools that I use.

Unfortunately, not all configurations can be of the SQL Server configuration type as we need an initial configuration that stores the connection string for the database that the SQL Server configurations live in.

This is where it becomes necessary to use an XML configuration file type of SSIS configuration. One of the problems with using an XML config file is that SSIS does not understand relative paths. This means that all environments must store the config file in the exact same directory structure, otherwise the file will not be found (except in some circumstances which I'll explain later in this post).

To get around this problem we can use something called an Indirect XML configuration file configuration type to store this single configuration containing the connection string for the SSIS configuration database. See the first configuration in the screen shot below.


The beauty of this configuration type is that it allows you to use an XML config file in such a way that its location is not environment specific. It does this by getting the location of the XML config file from a Windows Environment Variable. Unfortunately this must be created in environments where SSIS packages are being developed (where BIDS is installed) and in runtime environments where the config file does not live in the same directory as the SSIS packages being executed. While this does require added effort, it only has to be done once per environment. Below is a screen shot of the creation of a sample Environment Variable.


If you store your XML configuration file (.dtsconfig) in the same directory as the rest of your SSIS packages (.dtsx) then it is not necessary to create this environment variable on machines where your packages will be executed by the SSIS runtime and/or dtexec/dtexecui. SSIS will attempt to find the Environment Variable and when it cannot, it will then look in the root path of the SSIS packages and use the config file found there.

However, if your design requires that you store your XML config file in a location other than your package location you must create the Environment Variable in all locations.

At the risk of repeating myself, BIDS will not be able to find your config file unless you hard-code it's absolute path in a regular XML configuration file configuration (non-Indirect XML configuration) or create an Indirect XML configuration using an Environment Variable. This is true even if the config file lives in the same directory as your packages. This works fine when there is only one developer and that developer uses only machine to develop. Once you have multiple development team members with potentially different directory structures, you can avoid many headaches by making your XML config file location environment agnostic. Don't fear the Environment Variable!

Monday, September 12, 2011

OLE DB Destination or SQL Server Destination

SSIS offers a number of Data Flow destination components. When working with SQL Server databases as a destination there are two obvious choices.



OLE DB Destination




SQL Server Destination


Microsoft claims up to a 15% performance increase by using the SQL Server destination. Therefore it seems like a good choice. However, there is one rather large caveat. SSIS must be running on the same SQL Server as the destination database. It cannot write to a remote SQL Server.

That said, I have seen many solutions that have both SSIS and the destination database on the same server so it seems to make sense to use it in these situations. Personally, I choose not to. By using the SQL Server destination you're locking yourself into having to make a code change if you decide to split your SSIS and database instances across multiple servers. Having SSIS code dependent on a solution's hardware architecture is something I always try to avoid.

Wednesday, June 22, 2011

SQL Bulk Copy

In an earlier post I talked about a generic SSIS package for staging data using SQL Bulk Copy. I have been using it extensively without fail . . . until just recently.

I got a new set of files from a new business unit that we were bringing into our data warehouse solution. Of their 20 files 14 loaded without issue and the remaining 6 failed with an error similar to the one below.


System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.InvalidOperationException: The given value of type String from the data source cannot be converted to type int of the specified target column.


I kept going back to the developers that created the extract process and telling them that they were doing something wrong with their encoding or delimiters but neither us nor them could actually find the issue. Until now.

Turns out the issue is related to SQL Bulk Copy, numeric columns, and null values. Apparently SQL Bulk Copy does not, by default, keep nulls. Instead, it converts them to empty strings. Well, as I'm sure you know, SQL Server doesn't like it when you try to load empty strings, or any strings for that matter, into a numerically typed column.

There are a couple of solutions.

  1. Make sure to replace all nulls with zeros when created flat files with numeric data that will be consumed by BCP.
  2. Use the KeepNulls option when executing BCP. In my SSIS solution I changed my SqlBulkCopy object creation to look like this . . .
Dim bc As SqlBulkCopy = New SqlBulkCopy(dbConn, SqlBulkCopyOptions.KeepNulls, bulkCopyTransaction)

Tuesday, June 14, 2011

Data Warehouse QA Technical Screen Questions

I was recently asked to perform a technical phone screen of a data warehouse QA candidate. The project manager wanted to ensure that this person knew enough SQL to be able to efficiently test our data warehouse contents (not end-to-end processing testing).

So that I didn't sound like a bumbling fool when I made the call I put together a list of questions. It's by no means exhaustive but I figured it could be helpful to some . . .


When was the last time you used SQL on a day to day basis (how often do they use it and how rusty might they be)?

How would you rate your SQL from 1-10 (sets expectations for how well they will answer questions below)?

What database platforms/versions have you used (Oracle, SQL Server, MS Acess, MySQL, etc.)?

1) What are the two keywords necessary for any SQL statement?
A1) SELECT
A2) FROM

2) How do you limit the results of a query based one or more criterion?
A1) WHERE clause

  What is a join?
A1) It Combines records from two or more tables.

4)      What types of joins are there and what do they do?
A1) INNER JOIN
A2) LEFT OUTER JOIN
A3) RIGHT OUTER JOIN
A4) CROSS JOIN
A5) FULL OUTER JOIN

5)      What is the syntax for joining two tables?
 A1) FROM tableA
INNER JOIN tableB ON tableA.column1 = tableB.column2

6)      How do you get a single instance of each value of a column in a table?
A1) DISTINCT
7)    
            How do you get the number of occurrences of each distinct value in a table?
A1) COUNT, GROUP BY
8)     
            How do you get the largest value of a column in a table?
A1) MAX
      
9) What is the difference between putting criteria in a WHERE clause or a JOIN clause?
A1) When it is an INNER JOIN it does not affect the results.      
A2) When it is an OUTER JOIN, putting the criteria in the WHERE clause effectively makes it an INNER JOIN which is not usually the intended behavior.
A3) When it is an OUTER JOIN, putting the criteria in the JOIN clause allows the criteria to be applied to only those rows that are outer-joined and does not create the INNER JOIN – effect.
A4) Depending on the database platform and version, the performance could be affected.
     
10)   What is the difference between a WHERE clause and a HAVING clause?
A1) The WHERE clause applies the criteria at the row level whereas the HAVING clause applies it at the aggregate level.

11)   What is a derived table/inline view?
A1) It is a query that is aliased as a table and used like any other table in a FROM clause.

12)   What is a nested subquery?
A1) It is a query that is nested within a SELECT or a WHERE clause.

13)   What is a correlated subquery and when are they useful?
A1) It is a query that is used within a parent query and refers to the parent query.
A2) A correlated subquery is evaluated once for each row processed by the parent statement.
A3) They are useful in providing a set of values to be used in a WHERE clause.


Our data warehouse solution consists of the following.
       1)      Flat files that provide all source data
       2)      A staging database that the flat files are loaded into
       3)      A data warehouse database that contains the transformed data based on a documented set of rules.

1)      If you could assume that the flat files were loaded to the staging database without any errors, how would you test that the data was correctly transformed from the staging database to the data warehouse? What tools, if any, would you need?
A1) Write queries against the staging database that follow the transformation rules. Then write queries to pull the related rows from the
A2) Some kind of a query execution tool is necessary. Toad, RapidSQL, SQL Server Management Studio, etc.
2)      How do you test large sets of data where there could be millions of rows?
A1) Narrow down the results based on specific criteria.
A2) Apply aggregate functions such as SUM to determine if the test query aggregates to the same value as the data warehouse query.
A3) Join the test query/tables to the data warehouse query/tables and apply a WHERE clause that filters on only those values that do not equal each other.

3)      What data would you provide to the development team when you found inconsistencies between the documented transformations and what was in the data warehouse?
A1) The transformation rule
A2) The test query
A3) The data warehouse query (if independent from above)
A4) Explanation of expected versus actual results

Monday, March 28, 2011

A One-Package, Generic SSIS Staging Process


I recently worked on a project that was flat file and staging intensive. By intensive I mean we had essentially 1.5 ETL developers and over 50 flat files to stage. We also had a very aggressive timeline. Two data marts with source data coming from 5 totally disparate and disconnected (hence the flat files) subsidiaries/source systems all over the world (think currency conversion, translation, master data management, etc.) and only three and a half months to get it done.

Was I going to create a package to stage each flat file? Ummm . . . no! So what was I to do? I was going to create a single, metadata driven package that would loop through a data set and populate my staging tables one by one. Below is a screenshot of all the components necessary to make this work. There is no Data Flow.


Before going into the details there are a few things to keep in mind.

  1. This solution is based on text files. It can easily be tailored to read from relational sources and also non-relational sources with a little more tweaking.
  2. What makes this design possible is a Script Task that alleviates the need for custom data flows for each extract file/table to staged.
  3. The solution assumes that the staging tables can hold more than one day's/load's worth of data and therefore adds an ExtractFileID column to each staging table and to the source data as it is being loaded.
  4. For the script task to work as-is it is necessary to create a staging table for each data source that will be staged. Each staging table must meet the following criteria:
    • The staging table must contain column names that match the data source column names exactly.
    • The staging table columns must have data types that are appropriate for the incoming data. If they are character data types they must be large enough that they will not truncate any incoming data.  
The metadata that drives the solution might be specific to your project so I won't go into it in this post. What you'll need to do is create the metadata tables necessary to create a data set that can by used by a Foreach Loop container that will loop through each file or table that needs to be staged.

However, you can test the Script Task by hard-coding a single file's metadata within the Script Task. That is exactly what I did to create the script. Once the script was working as expected I created the necessary metadata tables and data and passed the values into the script to make it more dynamic. Below is the method that does the bulk of the work (no pun intended). The entire script can be downloaded here.



Private Sub BulkCopyDelimitedFile(ByVal batchSize As Integer, _
            ByVal columnDelimiter As String, _
            ByVal extractFileID As Int32, _
            ByVal extractFileFullPath As String, _
            ByVal stageTableName As String)

        Dim SqlConnectionManager As Microsoft.SqlServer.Dts.Runtime.ConnectionManager

        'Set a connection manager object equal to the connection manager named "Stage"
        SqlConnectionManager = Dts.Connections("Stage")

        'Since the "Stage" connection manager is of type OLEDB we need to modify it to make it compatible with an ADO.NET connection
        Dim dbConn As SqlConnection = New SqlConnection(SqlConnectionManager.ConnectionString.Replace("Provider=SQLNCLI10.1;", ""))

        'Create a new StreamReader object and pass in the path of the file to be read in
        Dim sr As StreamReader = New StreamReader(extractFileFullPath)

        'Read in the first line of the file
        Dim line As String = sr.ReadLine()

        'Create an array of strings and fill it with each column within the extract file by using the Split function
        Dim strArray As String() = line.Split(columnDelimiter)

        'Create a DataTable object
        Dim dt As DataTable = New DataTable()

        'Create a DataRow object
        Dim row As DataRow

        'Open the database connection
        dbConn.Open()

        'Create a SQLTransaction object in which to execute the Bulk Copy process so that if it fails, all writes are rolled back
        Dim bulkCopyTransaction As SqlTransaction = dbConn.BeginTransaction()

        'Instantiate our SqlBulkCopy object and set appropriate properties
        Dim bc As SqlBulkCopy = New SqlBulkCopy(dbConn, SqlBulkCopyOptions.Default, bulkCopyTransaction)

        'Set the BulkCopy destination table name equal to the staging table name provided by the Foreach Loop Container
        bc.DestinationTableName = stageTableName

        'Set the BulkCopy batch size equal to the size contained in the metadata provided by the Foreach Loop Container
        bc.BatchSize = batchSize

        'For each column that was found in the extract file
        For Each columnName As String In strArray
            'Add a column in the data table
            dt.Columns.Add(New DataColumn(columnName))

            'Add a column mapping in the SqlBulkCopy object
            bc.ColumnMappings.Add(columnName, columnName)
        Next

        'Add the ExtractFileID column to the data table since it doesn't exist in the extract file and wasn't added in the previous For Each loop
        dt.Columns.Add(New DataColumn("ExtractFileID", System.Type.GetType("System.String")))

        'Add the ExtractFileID column mapping since it doesn't exist in the extract file and wasn't added in the previous For Each loop
        bc.ColumnMappings.Add("ExtractFileID", "ExtractFileID")

        'Move the the first row in the extract file after the header
        line = sr.ReadLine()

        Dim rowCount As Integer

        'Loop through all rows in the extract file
        While Not line = String.Empty
            rowCount += 1

            'Create a new data table row to store the extract file's data in
            row = dt.NewRow()

            'Add all column values to the row
            row.ItemArray = line.Split(columnDelimiter)

            'Add the ExtractFileId 
            row("ExtractFileID") = extractFileID

            'Add the newly created row to the data table
            dt.Rows.Add(row)

            'Move to the next row in the extract file
            line = sr.ReadLine()
        End While
        

        Try
            'Write the data table to the staging table
            bc.WriteToServer(dt)

            'If successful, commit the transaction
            bulkCopyTransaction.Commit()
        Catch ex As Exception
            'If an exception occurs, rollback the transaction
            bulkCopyTransaction.Rollback()

            'Set the Task result to Failure
            Dts.TaskResult = ScriptResults.Failure


            Throw ex
        End Try

        'Close the database connection
        dbConn.Close()

        'Close the BulkCopy object
        bc.Close()

        'Set the SSIS metadata variable equal to the number of rows loaded
        Dts.Variables("ExtractActualRowCount").Value = rowCount
    End Sub


Download script