Recursive CTEs
This is something that a lot of people have blogged about, but since it cropped up again recently in a discussion I was having about part of a project I thought it’d be worthwhile writing something myself.
SQL Server doesn’t like looping.
We all know this. We all know that includes cursors and while loops, but what seems to be forgotton is that a recursive common table expression is also a loop. A requirement came up for a developer to find all of the directory numbers that were stored in a table including those that were “missing” from the table (think sequence). Let’s create some sample data to illustrate the problem.
DROP TABLE IF EXISTS [testEnvironment];
CREATE TABLE [testEnvironment]
(
[ID] INT IDENTITY(1, 1),
[DIRECTORY_NUMBER] VARCHAR(50) NULL
);
CREATE CLUSTERED INDEX [cl_ID_testEnvironment] ON [testEnvironment] ([ID] ASC);
CREATE NONCLUSTERED INDEX [nc_DIRECTORY_NUMBER_testEnvironment] ON [testEnvironment] ([DIRECTORY_NUMBER] ASC);
INSERT INTO [testEnvironment]
( [DIRECTORY_NUMBER]
)
SELECT DISTINCT
[a].[randomBigInt]
FROM ( SELECT TOP 1000000
( ABS(CHECKSUM(NEWID())) % 1000000 ) + 1 AS [randomBigInt]
FROM [master].[sys].[syscolumns] [sc1]
CROSS JOIN [master].[sys].[syscolumns] [sc2]
CROSS JOIN [master].[sys].[syscolumns] [sc3]
) [a];
Being a good developer, they decided to avoid a cursor or a while loop to solve this and instead chose a recursive common table expression. Initially, the design was like this: -
-- THESE ARE PARAMETERS PASSED IN TO A REPORT
DECLARE @initial AS BIGINT = 3500,
@final AS BIGINT = 4500;
DECLARE @TEMP TABLE
(
[ID] INT IDENTITY(1, 1),
[DIRECTORY_NUMBER] VARCHAR(50),
[IP_DESCRIPTION] VARCHAR(11)
);
INSERT INTO @TEMP
( [DIRECTORY_NUMBER],
[IP_DESCRIPTION]
)
SELECT [DIRECTORY_NUMBER],
'Assigned'
FROM [testEnvironment]
WHERE ( [DIRECTORY_NUMBER] >= @initial
AND [DIRECTORY_NUMBER] < @final
);
WITH [cte_n]
AS ( SELECT @initial AS [DIRECTORY_NUMBER]
UNION ALL
SELECT [cte_n].[DIRECTORY_NUMBER] + 1
FROM [cte_n]
WHERE [cte_n].[DIRECTORY_NUMBER] + 1 < @final
)
INSERT INTO @TEMP
( [DIRECTORY_NUMBER],
[IP_DESCRIPTION]
)
SELECT [cte].[DIRECTORY_NUMBER], 'Un-Assigned'
FROM [cte_n] [cte]
WHERE NOT EXISTS ( SELECT *
FROM @TEMP [X]
WHERE [X].[DIRECTORY_NUMBER] = [cte].[DIRECTORY_NUMBER] )
OPTION ( MAXRECURSION 0 );
SELECT *
FROM @TEMP
ORDER BY [DIRECTORY_NUMBER];
Running this over a thousand rows, as in the example above, worked fine. The report was created and the developer started testing. As a load test, he decided to see what would happen if someone were to request a hundred thousand rows out of the report. This caused the report to take longer than the timeout settings to return (in fact, it took longer than an hour!) and is where my advice was sought.
I suggested that since we are looking at sequential numbers, we could just create a numbers table or a cascading common table expression that generates a numbers work table in memory. Below is an example of the cascading common table expression: -
-- THESE ARE PARAMETERS PASSED IN TO A REPORT
DECLARE @initial AS BIGINT = 3500,
@final AS BIGINT = 135000;
WITH [cte_n10] ( [N] )
AS ( SELECT 1
FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) [a] ( [n] )
),
[cte_n100] ( [N] )
AS ( SELECT 1
FROM [cte_n10] [a]
CROSS JOIN [cte_n10] [b]
),
[cte_n10000] ( [N] )
AS ( SELECT 1
FROM [cte_n100] [a]
CROSS JOIN [cte_n100] [b]
),
[cte_n1000000] ( [N] )
AS ( SELECT 1
FROM [cte_n100] [a]
CROSS JOIN [cte_n10000] [b]
),
[cte_RN] ( [N] )
AS ( SELECT 0
UNION ALL
SELECT TOP ( @final - @initial )
ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL
) )
FROM [cte_n1000000]
),
[cte_n] ( [DIRECTORY_NUMBER] )
AS ( SELECT @initial + [cte_RN].[N]
FROM [cte_RN]
WHERE @initial + [cte_RN].[N] <= @final
)
SELECT [cte].[DIRECTORY_NUMBER],
CASE WHEN [te].[DIRECTORY_NUMBER] IS NOT NULL THEN 'Assigned'
ELSE 'Un-Assigned'
END AS [IP_DESCRIPTION]
FROM [cte_n] [cte]
LEFT OUTER JOIN [testEnvironment] [te] ON [cte].[DIRECTORY_NUMBER] = [te].[DIRECTORY_NUMBER]
ORDER BY [cte].[DIRECTORY_NUMBER];
This takes advantage of a few cartesian joins to generate a set of numbers, followed by a window function and a top to limit the overall result-set. When we’re looking at few numbers, the cascading common table expression will only expand the parts of the query it actually needs (so for 10 rows, it’ll look at cte_n10, cte_RN and cte_n).
Let’s create a quick performance test to see what happens over a few different rows.
SET NOCOUNT ON;
DECLARE @TABLE AS TABLE
(
[Initial] BIGINT,
[Final] BIGINT
);
INSERT INTO @TABLE
( [Initial], [Final] )
VALUES ( 3500, 4500 ),
( 500, 10498 ),
( 673, 114980 );
DECLARE @Duration CHAR(12),
@StartTime DATETIME,
@Size INT;
DECLARE @initial AS BIGINT,
@final AS BIGINT;
WHILE EXISTS ( SELECT 1
FROM @TABLE )
BEGIN
SELECT TOP 1
@initial = [Initial],
@final = [Final]
FROM @TABLE
ORDER BY [Final] - [Initial];
SET @Size = @final - @initial;
DROP TABLE IF EXISTS #resultsHolder1;
SELECT @StartTime = GETDATE();
WITH [cte_n10] ( [N] )
AS ( SELECT 1
FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) [a] ( [n] )
),
[cte_n100] ( [N] )
AS ( SELECT 1
FROM [cte_n10] [a]
CROSS JOIN [cte_n10] [b]
),
[cte_n10000] ( [N] )
AS ( SELECT 1
FROM [cte_n100] [a]
CROSS JOIN [cte_n100] [b]
),
[cte_n1000000] ( [N] )
AS ( SELECT 1
FROM [cte_n100] [a]
CROSS JOIN [cte_n10000] [b]
),
[cte_RN] ( [N] )
AS ( SELECT 0
UNION ALL
SELECT TOP ( @final - @initial )
ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL
) )
FROM [cte_n1000000]
),
[cte_n] ( [DIRECTORY_NUMBER] )
AS ( SELECT @initial + [cte_RN].[N]
FROM [cte_RN]
WHERE @initial + [cte_RN].[N] <= @final
)
SELECT [cte].[DIRECTORY_NUMBER],
CASE WHEN [te].[DIRECTORY_NUMBER] IS NOT NULL THEN 'Assigned'
ELSE 'Un-Assigned'
END AS [IP_DESCRIPTION]
INTO [#resultsHolder1]
FROM [cte_n] [cte]
LEFT OUTER JOIN [testEnvironment] [te] ON [cte].[DIRECTORY_NUMBER] = [te].[DIRECTORY_NUMBER]
ORDER BY [cte].[DIRECTORY_NUMBER];
SELECT @Duration = CONVERT(CHAR(12), GETDATE() - @StartTime, 114);
RAISERROR('New Query Duration: %s over %d rows',0,1,@Duration,@Size) WITH NOWAIT;
DROP TABLE IF EXISTS #resultsHolder2
SELECT @StartTime = GETDATE();
DECLARE @TEMP TABLE
(
[ID] INT IDENTITY(1, 1),
[DIRECTORY_NUMBER] VARCHAR(50),
[IP_DESCRIPTION] VARCHAR(11)
);
INSERT INTO @TEMP
( [DIRECTORY_NUMBER],
[IP_DESCRIPTION]
)
SELECT [DIRECTORY_NUMBER],
'Assigned'
FROM [testEnvironment]
WHERE ( [DIRECTORY_NUMBER] >= @initial
AND [DIRECTORY_NUMBER] < @final
);
WITH [cte_n]
AS ( SELECT @initial AS [DIRECTORY_NUMBER]
UNION ALL
SELECT [cte_n].[DIRECTORY_NUMBER] + 1
FROM [cte_n]
WHERE [cte_n].[DIRECTORY_NUMBER] + 1 < @final
)
INSERT INTO @TEMP
( [DIRECTORY_NUMBER],
[IP_DESCRIPTION]
)
SELECT [cte].[DIRECTORY_NUMBER],
'Un-Assigned'
FROM [cte_n] [cte]
WHERE NOT EXISTS ( SELECT *
FROM @TEMP [X]
WHERE [X].[DIRECTORY_NUMBER] = [cte].[DIRECTORY_NUMBER] )
OPTION ( MAXRECURSION 0 );
SELECT [DIRECTORY_NUMBER],
[IP_DESCRIPTION]
INTO [#resultsHolder2]
FROM @TEMP
ORDER BY [DIRECTORY_NUMBER];
SELECT @Duration = CONVERT(CHAR(12), GETDATE() - @StartTime, 114);
RAISERROR('Initial Query Duration: %s over %d rows',0,1,@Duration,@Size) WITH NOWAIT;
DELETE FROM @TABLE
WHERE [Initial] = @initial
AND [Final] = @final;
END;
I inserted the result-set in to temporary tables to take the display time out of the equation. If you look at the messages tab, you will have something like looks like this: -
New Query Duration: 00:00:00:087 over 1000 rows
Initial Query Duration: 00:00:00:167 over 1000 rows
New Query Duration: 00:00:00:087 over 9998 rows
Initial Query Duration: 00:00:07:440 over 9998 rows
New Query Duration: 00:00:00:443 over 114307 rows
Note, I killed the query after 5 minutes, at which point the initial query had still not managed to return for 114,307 rows. Even without that part of the execution to compare with, we’re left with a fairly large improvement: -
Rows | Initial Query Duration | New Query Duration |
---|---|---|
1000 | 0.167 | 0.087 |
9998 | 7.440 | 0.087 |
114307 | 05:00.000 (killed query) | 0.443 |
That’s an improvement with no indexes and no major design change.
Share on: