Sunday, February 28, 2010

Clustered Index Doesn't Have to be the Primary Key

Believe it or not, you don’t have to cluster your primary keys.

Most SQL Server mortals assume that their primary keys should be clustered because that is the default behavior of SQL Server Management Studio when using the designer to create a table. Unfortunately, the primary key is not always the best candidate for the one and only clustered index you're allowed on a table.

I won't go into detail here about what a clustered index is but I'll say that it is essentially your table and the order in which your table rows/records/tuples are stored is based on the column(s) used in your clustered index.

In an OLTP database where your pimary keys consist of a single integer identity column you're usually in good shape if your clustered index is also your primary key.

However, if you use composite and/or "intelligent" keys you're going to experience a number of performance problems. I won't expound on these problems here because I almost never use intelligent or composite keys in an OLTP system and don't believe they should be used in such systems.

The place I most often have a non-primary key clustered index is in an OLAP fact table. OLAP systems are more concerned about high performance reads of blocks of data as opposed to reads and writes of individual rows as is the case in transactional systems.

In a data warehouse/OLAP database date is often an attribute used in most analyses. By creating a clustered index on the most important date in a fact table (i.e. order date in an order table), you enable better I/O and therefore faster query response times for the majority of queries executed.

Some might argue that the unique surrugate key of the fact table (if you even have one) should have a clustered index. Hmmm . . . how often do you query a fact table and have a surrogate key based predicate? The answer should probably be never (although I almost never say never)!

How about writes? Fact tables are often written to based on a date. By clustering on date we typically have sequential writes to the disk. When we load today's data we end up adding data to the end of the last data page for the table because today's date comes after yesterday's date. If we were to cluster on some kind of an intelligent key we'd end up having to insert rows into pre-existing data pages which leads to much higher I/O and therefore decreased performance.

How about archiving and partitioning? In most OLAP scenarios the business requirement is to have something like 25 months worth of data available in the system at all times. It's often beneficial for performance and disk expense reasons to take old data offline. The best way to do this in a SQL Server environment is to partition your fact tables using a year/month based partition function. For partitioning to be successful it is necessary for your partition column to be part of the clustered index. If you cluster on your date column and partition on your date column as well, you're all set.

In short, if you're working in a data warehouse/OLAP environment pay careful attention to what column(s) you create your clustered indexes on. It can make or break the performance of your solution!