SQL Server – Indexing Strategies: Clustered,
NonClustered, and Filtered Indexes
Indexing decisions can
have huge impacts on database performance
This article reviews
optimal placement of clustered and nonclustered indexes on OLTP databases, and
explains how filtered indexes can be used to improve performance.
Clustered Indexes
By default, SQL Server will create the table’s clustered
index during the creation of the primary key:
|
CREATE TABLE PrimaryKeyTest (MyPK INT
PRIMARY KEY)
GO
|
|
SELECT * FROM Sys.Indexes WHERE
Object_ID = Object_ID ('PrimaryKeyTest')
|
This can be overridden by specifying the
NONCLUSTERED keyword during creation:
|
CREATE TABLE PrimaryKeyNonClusteredTest
(MyPK INT
PRIMARY KEY NONCLUSTERED)
GO
|
|
SELECT * FROM Sys.Indexes
WHERE Object_ID = Object_ID ('PrimaryKeyNonClusteredTest')
|
The HEAP exists because the table does not have
a clustered index defined.
It’s a common misconception that the primary key
and clustered index are bound together; the primary key is a special type of
unique constraint (it does not allow NULL values, while normal unique
constraints do). The Clustered index is a strictly-ordered B-tree intended to
maximize scan performance. Only one clustered index per table is permitted.
Although SQL Server defaults the clustered index
to the primary key, it is not always the optimal selection. The best choice for
the clustered index is selective (many distinct values), narrow (few columns),
changes infrequently (if at all), and is often used in queries that scan
a range sequentially. This is something that’s rarely done with a primary
key; it makes little sense to retrieve a range of values based on a surrogate
or business key. The narrowness is important because all non-clustered indexes
use the clustered index as the lookup key when accessing data, so bloating the
clustered index will do the same to nonclustered indexes. Changes to the
clustered index keys cause table re-ordering and fragmentation.
An ideal example of a clustered index use would
be on a Sales table where the sale date is recorded in a datetime field. This
would not change, would be highly selective, and often used in range queries
involving ORDER/GROUP BY.
Markus Winand has a highly detailed explanation of how SQL Server implements
clustered indexes.
NonClustered Indexes
Unlike clustered indexes, nonclustered indexes
contain pointers to the actual data, rather than physically storing the data.
Because of this, multiple indexes can be created on the same table (up to 1,000
total). Although there is a performance hit during DML operations to update
nonclustered indexes, the benefits greatly outweigh the downsides. They should
be used in the following scenarios:
·
On any column often
accessed by a WHERE clause
·
On any column often used
in JOIN or GROUP BY clauses
Additionally, Included columns
can be added to the nonclustered index for nonkey columns. These are columns
referenced in the SELECT portion of the query, rather than the WHERE clause. By
adding these as Included columns, the entire query can
be satisfied by the index, rather than necessitating additional I/O.
Filtered Indexes
Filtered indexes are a subtype of nonclustered
indexes introduced in SQL Server 2008. These indexes have a WHERE clause in the
index definition intended to reduce the total number of rows included in the
index. Fewer rows means less I/O while accessing the index.
Here’s a simplified example to illustrate
potential performance gains.
The table SalesFigures is created just to hold some monetary
data:
|
CREATE TABLE SalesFigures (SalesID INT
IDENTITY, SalesAmount Money)
|
Now it gets populated with a reasonable amount
of data (About 1.4 million rows):
|
DECLARE @i INT = 1, @Finish INT
= 100000
WHILE (@i < @Finish)
BEGIN
Insert SalesFigures (SalesAmount)
VALUES (100), (200), (300), (350), (375), (400), (425), (478),
(512), (550), (700), (750), (900), (1000)
Set @i+=1
END
|
Let’s also create a standard nonclustered index
for comparison purposes:
|
CREATE NONCLUSTERED INDEX
IX_SalesFigures_SalesAmount
ON SalesFigures
(SalesAmount)
GO
|
In our simplified example, we’ll assume that the
Sales value between 400 and 800 are by far the most interest during reporting:
|
SELECT SalesAmount FROM
SalesFigures
WHERE SalesAmount BETWEEN
400 AND 800
|
Our I/O costs are ~1.4.
Next we’ll drop our original index, and create a
filtered one using the values 400 and 800 as ranges for the where clause.
|
DROP INDEX SalesFigures.IX_SalesFigures_SalesAmount
GO
CREATE NONCLUSTERED INDEX
IX_SalesFigures_SalesAmount
ON SalesFigures
(SalesAmount)
WHERE SalesAmount >= 400 and
SalesAmount <= 800
GO
|
Under the hood, the filter predicate is stored
directly in Sys.Indexes:
|
SELECT filter_definition, * FROM
Sys.Indexes
WHERE Name =
'IX_SalesFigures_SalesAmount'
|
Re-running our same query produces the
following:
I/O costs have dropped to ~.7 – about 50% of the
standard nonclustered index. Even in this highly simplified example, the
performance improvements are apparent.
There are some limitations to the search
predicates that are available – CASE Statements, NOT
IN, the BETWEEN keyword, and date functions cannot be used
in the definition of a filtered index.
Sparse Columns and Filtered Indexes
Sparse is a column-level option that can be
applied during table creation or alteration. Specifying the Sparse keyword
during table creation minimizes the space requirements for NULL values for that
column. This option can be used on most SQL Server datatypes (excluded are
text, ntext, image, timestamp, geometry, geography, timestamp, and user-defined
types). The trade-off is that non-NULL values will require slightly more
overhead when read.
Sparse columns and filtered indexes are often
used together to maximize storage and performance where significant NULL values
are expected for a particular column. The following syntax creates a table with
a Sparse column, then creates a filtered index on that column specifying only
non-NULL values:
|
CREATE TABLE MoreSalesFigures
(SalesID INT IDENTITY,
SalesAmount Money,
ThisColumnIsNULLOften
Varchar(100) SPARSE)
GO
|
|
CREATE NONCLUSTERED INDEX
IX_MoreSalesFigures_ThisColumnIsNULLOften
ON MoreSalesFigures
(ThisColumnIsNULLOften)
WHERE ThisColumnIsNULLOften IS
NOT NULL
GO
|
This minimizes storage requirements for NULL
values while maximizing performance when reading non-NULL values. In addition
to optimizing NULL values, Sparse columns optimize zeroes in
the same way. There is also no realistic limit to how many Sparse columns
can be defined on a table (the actual limit is 100,000)
c
No comments:
Post a Comment