Tuesday, August 25, 2009

Top Clause - Using Dynamic Values

Prior to SQL Server 2005, the value in the TOP clause was limited to literal values which meant that variables could not be used. Hence, the number of rows returned could not be variable/dynamic. Below is an example:

SELECT TOP 10 * FROM table1

New to SQL Server 2005 is the ability to use a variable in the TOP clause.

SET @top = 10

SELECT TOP (@top) *
FROM table1

This is pretty useful when there is a requirement to have a user determine the number of rows returned in a result set. Note that the variable must be between parantheses.

Monday, August 24, 2009

Great Book for Building a SQL Server BI Solution

While written for SQL Server 2005 and a little out of date for SQL Server 2008, The Microsoft Data Warehouse Toolkit: With SQL Server2005 and the Microsoft Business Intelligence Toolset is a must read for anyone looking to learn and/or apply the Kimball methodology of dimensional modeling while using the Microsoft SQL Server BI stack (SSIS, SSRS, SSAS).

The book starts by describing what the Kimball group has determined to be the best way to execute the requirements gathering phase, it then moves on to designing the dimensional model including the physical relational data warehouse, creating the ETL infrastructure (SSIS), an OLAP database (SSAS), creating standard reports (SSRS), dealing with partitioning, security, maintenance, etc.

Someone with a few years of SQL Server experience can build a full-blown, mid-sized BI solution with this book alone.

The book also has a great, although not physically attractive, companion website. Check out the tools section for some invaluable tools including a dimensional modeling spreadsheet that is perfect for the first few iterations of the relational data warehouse. While not extremely fancy it will generate quality "create" scripts for the initial development phase of your project.

Interview Questions

I recently conducted technical interviews for one of my clients to help find a mid-level SQL Server BI developer. Below are the questions I came up with for what tended to be 30-60 minute telephone interviews. The answers were written in such a way that a non-technical person could follow along and gauge whether or not the candidate was on the right track.

Download the questions in a Word document.

Data warehousing/Dimensional Modeling
1. Q: What is the Kimball Methodology/Dimensional Modeling?
A: Dimensional modeling always uses the concepts of facts, AKA 'measures', and dimensions. Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.

2. Q: What is meant by level of granularity in a fact table?
A: At what level of detail are we storing data? We can store sales data by store by day by product or by day by region by product or by receipt item. Receipt item level would be the finest level of grain and is the favored approach assuming there are the resources (disk, processing power, etc.) to support it.

3. Q: Explain the concept of Conformed Dimensions.
A: At the most basic level, a conformed dimension means the exact same thing with every possible fact table to which it is joined. The earn site dimension table is related to both the earn reward and burn reward table. This way, we can consistently look at both the earn and burn data as it relates to earns site information.

4. Q: What is a 1) degenerate dimension 2) a junk dimension and provide an example of each?A1: A degenerate dimension is a single attribute dimension typically consisting of a transaction level identifier that does not warrant its own relational table or OLAP dimension. Examples are PO number, order number, invoice number, etc.
A2: A junk dimension is a way of organizing miscellaneous, mostly unrelated indicator/flag type attributes that would not otherwise belong to a dimension. It’s a method of reducing the clutter that can result from having many small and/or single attribute dimensions. Junk dimensions typically consist of the Cartesian product of all values. A junk dimension could include such values as order status, new customer indicator, shipping method, etc.

5. Q: What is a star-schema and why do we use it for data warehousing?
A: A star schema has a 3rd normal form fact table at the center with related 2nd normal form dimension tables with primary keys pointing back to foreign keys in the fact table. We use the star-schema because queries against it are simple and highly efficient.

6. Q: What are 1) non-additive facts 2) semi-additive facts and provide an example of each?
A: Non-additive facts are measures in a fact table that cannot be added. Any average would be an example of this. Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Account balance can be added up across all accounts but not over time.

7. Q: What is a slowly changing dimension and what are the different types?
A: A slowly changing dimension is one in which the attributes change over time.
Type 1 – Only one record is stored. Old data is overwritten with new data.
Type 2 – Tracks historical data by adding new records each time an attribute changes.
Type 3 – Tracks history via additional columns within a single record.

8. Q: What is a surrogate key and why do we use them?
A: It’s a substitute for a natural/business key and is typically an integer. In SQL Server we often use an identity column of type INT or BIGINT. We use them because we can’t rely on the business key as they can change or overlap between multiple source systems and they are the most efficient in terms of joins.

General Relational Database
9. Q: Have you designed and built OLTP (transactional) and/or OLAP (analytical/reporting) databases. Have you used any data modeling tools? How many tables did these databases consist of?
A: A qualified candidate will have designed and built numerous OLTP and OLAP database. They might mention ERWin or Rational. While having used data modeling tool is not very important, if they have, it reinforces the fact that they have performed database design work. They should have designed some databases with 20+ tables.

10. Q: What size databases (MB/GB/TB) have you worked with?
A: They shouldn’t have to think too hard about this. Someone that can’t come up with an answer immediately has probably never “owned” a database. A qualified candidate should have worked with at least a 100GB database. 10GB+ could be acceptable. Not good if they’re talking in MBs.

11. Q: What is 3rd normal form and what is the main benefit of having a normalized database? Why would you want to denormalize?
A: Every non-key column must provide a fact about the key, the whole key, and nothing but the key (all columns are directly dependent on the primary key). Having a normalized database increases data integrity. Denormalization can be used to increase performance.

12. Q: What additional techniques can be used to ensure data integrity?
A: Unique constraints, check constraints, foreign key constraints, cascading updates and deletes

13. Q: What is an outer join? How is it different from an inner join?
A: It’s a join that includes all the rows from one table and only those rows from the other that have corresponding join values. An inner join only includes those rows that have corresponding values in both tables of the join.

14. Q: What is a bridge table (a.k.a. join table, link table, junction table, map table)
A: Used to handle many-to-many relationships since most relational database systems only support one-to-many relationships.

Microsoft SQL Server
15. Q: What are the differences between clustered and non-clustered indexes and how many of each can you have on a table?
A: The leaf node of a clustered index contains the actual data pages. It determines the physical order in which the table data is stored. The row locator in a clustered index is the clustered index key. There can be only one clustered index on a table. In a non-clustered index the row locator is a pointer to the row. There can be 249 non-clustered indexes on a table.

16. Q: What are Included Columns when referring to a non-clustered index in SQL Server and why would you use them?
A: Included Columns are non-key columns added to the leaf level of a non-clustered index. This functionality allows us to store data alongside the index’s key column(s). We use this feature to increase performance by decreasing disk I/O because we don’t need to access the clustered index if all columns requested in a query are present as either key columns or Included Columns. When all columns are present in an index it is referred to as a covering index.

17. Q: What tools can be used to performance tune a SQL Server database and how do you use them?
A: Tools include Profiler, Database Engine Tuning Advisor (DETA), Query Plans in SQL Server Management Studio, Dynamic Management Views (DMVs), Windows System Monitor/Performance Counters.
Use Profiler to identify problem queries as well as locking and blocking. Use profiler to generate a trace file to be analyzed by DETA to provide suggested indexes and statistics.

18. Q: What does the ISNULL function do and what is the syntax?
A:When applied to a column, if the value of the current row is null it returns the specified value. ISNULL(mycolumn,’NO VALUE’)

19. Q: How would you create a pivot/crosstab using on SQL?
A: Common table expressions (CTE), the SQL PIVOT operator, CASE statements

20. Q: Have you had any experience with partitioning or replication and if so, describe?
A: Any exposure to these technologies is useful. No need to dive too deep.

SSIS (SQL Server Integration Services)
21. Q: What are the different methods available for providing configuration information to SSIS packages and which do you consider to be the best and why?
A: XML file, Environment Variable, SQL Server, registry, parent package variable. XML files don’t work for package promotion through environments (dev, test, prod). Preferred approach is to use an environment variable that points to a configuration database and then use SQL Server configuration for all other configuration values.

22. Q: What is your approach to debugging SSIS packages?
A: Inserting breakpoints in the control flow, turning on detailed logging, using Profiler

23. Q: How do you version your SSIS packages?
A: There are version properties/attributes on SSIS packages that can be set by the developer.
Another valid answer would be by using a source control tool.

24. Q: Name the transformations you use most in SSIS.
A: Should be able to quickly name some of the following: lookup, merge join, union all, multicast, derived column, aggregate, OLE DB command, pivot, unpivot, slowly changing dimension

25. Q: How have you deployed packages?
A: Packages can be deployed to SQL Server or the file system using Visual Studio, an installation utility generated by Visual Studio, and by simply copying and pasting files. Generating an install utility and copy/paste are the preferred methods.

SSRS (SQL Server Reporting Services)
26. Q: What kinds of data sources have you used when writing reports and which do you prefer and why?
A: Stored procedures, SQL directly in report, MDX directly in report

27. Q: How do you implement cascading parameters in SSRS? (A cascading parameter is one that drives the values of another dependent parameter e.g. country -> state/province)
A: Include the parent parameter (country) in the WHERE clause of the child parameter’s (state/province) query.

28. Q: How have you made your SSRS reports available to users?
A: There is an out of the box report portal referred to as Report Manager. This is what Override uses. SharePoint and custom applications can also be used to host reports.

29. Q: Have you had any experience with Report Builder and Report Builder models?
A: Any experience is a plus. If they really have worked with it they’ll probably say that Report Builder was not a great tool.

SSAS (SQL Server Analysis Services)30. Q: How do you include multiple fact tables in a cube?
A: Multiple measure groups, multiple linked measures/virtual cubes

31. Q: Have you implemented incremental cube processing and if so, how?
A: SQL queries leveraging some sort of a filter to identify only new data.

32. Q: What is a Fact dimension relationship type?
A: It allows for the dimensioning of values found within a fact table. It can be used to expose degenerate dimensions.

Checkpoint Files - Performance Issues w/ Loop Containers

During an interview today I asked a candidate if she had experience using SSIS Checkpoint functionality. Although I haven't confirmed what she told me, I suspect that it is true. She said that she has experienced performance issues related to using Checkpoint files in packages containing loop containers. My guess is that SSIS tracks the value of the variable being used for the loop counter in the checkpoint file. Writing a value to disk for every iteration in a loop probably has very significant performance effects.

Clustered Index Update - Why?

I was just troubleshooting a long-running query that happened to be an UPDATE statement and found that a Clustered Index Update operation was a good portion of the total query execution plan. Unfortunately, the table I'm updating has a composite primary key that is also the clustered index. However, the statement was not updating any columns in the key.

So why would there be a Clustered Index Update operation? Simply put, the clustered index is the table. What I mean by this is that table data is stored alongside the clustered key with this key dictating the order in which the rows are stored. Hence, any update to a table is an update to its clustered index.

Also, keep in mind that a clustered index update that affects the clustered key also has to update the all non-clustered indexes since all non clustered indexes reference the clustered key. However, this is not applicable to the above scenario as my UPDATE statement was not updating the clustered index/key.