Recursive CTEs // SQL and other ramblings

This is something that a lot of people have blogged about, but since it cropped up again recently in a discussion I was having about part of a project I thought it’d be worthwhile writing something myself.

SQL Server doesn’t like looping.

We all know this. We all know that includes cursors and while loops, but what seems to be forgotton is that a recursive common table expression is also a loop. A requirement came up for a developer to find all of the directory numbers that were stored in a table including those that were “missing” from the table (think sequence). Let’s create some sample data to illustrate the problem.

DROP TABLE IF EXISTS [testEnvironment];
  
CREATE TABLE [testEnvironment]
    (
      [ID] INT IDENTITY(1, 1),
      [DIRECTORY_NUMBER] VARCHAR(50) NULL
    );

CREATE CLUSTERED INDEX [cl_ID_testEnvironment] ON [testEnvironment] ([ID] ASC);
CREATE NONCLUSTERED INDEX [nc_DIRECTORY_NUMBER_testEnvironment] ON [testEnvironment] ([DIRECTORY_NUMBER] ASC);

INSERT  INTO [testEnvironment]
        ( [DIRECTORY_NUMBER]
        )
SELECT DISTINCT
        [a].[randomBigInt]
FROM    ( SELECT TOP 1000000
                    ( ABS(CHECKSUM(NEWID())) % 1000000 ) + 1 AS [randomBigInt]
          FROM      [master].[sys].[syscolumns] [sc1]
          CROSS JOIN [master].[sys].[syscolumns] [sc2]
          CROSS JOIN [master].[sys].[syscolumns] [sc3]
        ) [a];

Being a good developer, they decided to avoid a cursor or a while loop to solve this and instead chose a recursive common table expression. Initially, the design was like this: -

-- THESE ARE PARAMETERS PASSED IN TO A REPORT
DECLARE @initial AS BIGINT = 3500,
    @final AS BIGINT = 4500;

DECLARE @TEMP TABLE
    (
      [ID] INT IDENTITY(1, 1),
      [DIRECTORY_NUMBER] VARCHAR(50),
      [IP_DESCRIPTION] VARCHAR(11)
    );

INSERT  INTO @TEMP
        ( [DIRECTORY_NUMBER],
          [IP_DESCRIPTION]
        )
SELECT  [DIRECTORY_NUMBER],
        'Assigned'
FROM    [testEnvironment]
WHERE   ( [DIRECTORY_NUMBER] >= @initial
          AND [DIRECTORY_NUMBER] < @final
        );

WITH    [cte_n]
          AS ( SELECT   @initial AS [DIRECTORY_NUMBER]
               UNION ALL
               SELECT   [cte_n].[DIRECTORY_NUMBER] + 1
               FROM     [cte_n]
               WHERE    [cte_n].[DIRECTORY_NUMBER] + 1 < @final
             )
    INSERT  INTO @TEMP
            ( [DIRECTORY_NUMBER],
              [IP_DESCRIPTION]
            )
    SELECT  [cte].[DIRECTORY_NUMBER], 'Un-Assigned'
    FROM    [cte_n] [cte]
    WHERE   NOT EXISTS ( SELECT *
                         FROM   @TEMP [X]
                         WHERE  [X].[DIRECTORY_NUMBER] = [cte].[DIRECTORY_NUMBER] )
    OPTION  ( MAXRECURSION 0 );

SELECT  *
FROM    @TEMP
ORDER BY [DIRECTORY_NUMBER];

Running this over a thousand rows, as in the example above, worked fine. The report was created and the developer started testing. As a load test, he decided to see what would happen if someone were to request a hundred thousand rows out of the report. This caused the report to take longer than the timeout settings to return (in fact, it took longer than an hour!) and is where my advice was sought.

I suggested that since we are looking at sequential numbers, we could just create a numbers table or a cascading common table expression that generates a numbers work table in memory. Below is an example of the cascading common table expression: -

-- THESE ARE PARAMETERS PASSED IN TO A REPORT
DECLARE @initial AS BIGINT = 3500,
    @final AS BIGINT = 135000;

WITH    [cte_n10] ( [N] )
          AS ( SELECT   1
               FROM     ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) [a] ( [n] )
             ),
        [cte_n100] ( [N] )
          AS ( SELECT   1
               FROM     [cte_n10] [a]
               CROSS JOIN [cte_n10] [b]
             ),
        [cte_n10000] ( [N] )
          AS ( SELECT   1
               FROM     [cte_n100] [a]
               CROSS JOIN [cte_n100] [b]
             ),
        [cte_n1000000] ( [N] )
          AS ( SELECT   1
               FROM     [cte_n100] [a]
               CROSS JOIN [cte_n10000] [b]
             ),
        [cte_RN] ( [N] )
          AS ( SELECT   0
               UNION ALL
               SELECT TOP ( @final - @initial )
                        ROW_NUMBER() OVER ( ORDER BY ( SELECT   NULL
                                                     ) )
               FROM     [cte_n1000000]
             ),
        [cte_n] ( [DIRECTORY_NUMBER] )
          AS ( SELECT   @initial + [cte_RN].[N]
               FROM     [cte_RN]
               WHERE    @initial + [cte_RN].[N] <= @final
             )
    SELECT  [cte].[DIRECTORY_NUMBER],
            CASE WHEN [te].[DIRECTORY_NUMBER] IS NOT NULL THEN 'Assigned'
                 ELSE 'Un-Assigned'
            END AS [IP_DESCRIPTION]
    FROM    [cte_n] [cte]
    LEFT OUTER JOIN [testEnvironment] [te] ON [cte].[DIRECTORY_NUMBER] = [te].[DIRECTORY_NUMBER]
    ORDER BY [cte].[DIRECTORY_NUMBER];

This takes advantage of a few cartesian joins to generate a set of numbers, followed by a window function and a top to limit the overall result-set. When we’re looking at few numbers, the cascading common table expression will only expand the parts of the query it actually needs (so for 10 rows, it’ll look at cte_n10, cte_RN and cte_n).

Let’s create a quick performance test to see what happens over a few different rows.

SET NOCOUNT ON;

DECLARE @TABLE AS TABLE
    (
      [Initial] BIGINT,
      [Final] BIGINT
    );
INSERT  INTO @TABLE
        ( [Initial], [Final] )
VALUES  ( 3500, 4500 ),
        ( 500, 10498 ),
        ( 673, 114980 );

DECLARE @Duration CHAR(12),
    @StartTime DATETIME,
    @Size INT;

DECLARE @initial AS BIGINT,
    @final AS BIGINT;

WHILE EXISTS ( SELECT   1
               FROM     @TABLE )
BEGIN
    SELECT TOP 1
            @initial = [Initial],
            @final = [Final]
    FROM    @TABLE
    ORDER BY [Final] - [Initial];

    SET @Size = @final - @initial;

    DROP TABLE IF EXISTS #resultsHolder1;

    SELECT  @StartTime = GETDATE();

    WITH    [cte_n10] ( [N] )
              AS ( SELECT   1
                   FROM     ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) [a] ( [n] )
                 ),
            [cte_n100] ( [N] )
              AS ( SELECT   1
                   FROM     [cte_n10] [a]
                   CROSS JOIN [cte_n10] [b]
                 ),
            [cte_n10000] ( [N] )
              AS ( SELECT   1
                   FROM     [cte_n100] [a]
                   CROSS JOIN [cte_n100] [b]
                 ),
            [cte_n1000000] ( [N] )
              AS ( SELECT   1
                   FROM     [cte_n100] [a]
                   CROSS JOIN [cte_n10000] [b]
                 ),
            [cte_RN] ( [N] )
              AS ( SELECT   0
                   UNION ALL
                   SELECT TOP ( @final - @initial )
                            ROW_NUMBER() OVER ( ORDER BY ( SELECT   NULL
                                                         ) )
                   FROM     [cte_n1000000]
                 ),
            [cte_n] ( [DIRECTORY_NUMBER] )
              AS ( SELECT   @initial + [cte_RN].[N]
                   FROM     [cte_RN]
                   WHERE    @initial + [cte_RN].[N] <= @final
                 )
        SELECT  [cte].[DIRECTORY_NUMBER],
                CASE WHEN [te].[DIRECTORY_NUMBER] IS NOT NULL THEN 'Assigned'
                     ELSE 'Un-Assigned'
                END AS [IP_DESCRIPTION]
        INTO    [#resultsHolder1]
        FROM    [cte_n] [cte]
        LEFT OUTER JOIN [testEnvironment] [te] ON [cte].[DIRECTORY_NUMBER] = [te].[DIRECTORY_NUMBER]
        ORDER BY [cte].[DIRECTORY_NUMBER];

    SELECT  @Duration = CONVERT(CHAR(12), GETDATE() - @StartTime, 114);

    RAISERROR('New Query Duration: %s over %d rows',0,1,@Duration,@Size) WITH NOWAIT;

    DROP TABLE IF EXISTS #resultsHolder2

    SELECT  @StartTime = GETDATE();

    DECLARE @TEMP TABLE
        (
          [ID] INT IDENTITY(1, 1),
          [DIRECTORY_NUMBER] VARCHAR(50),
          [IP_DESCRIPTION] VARCHAR(11)
        );

    INSERT  INTO @TEMP
            ( [DIRECTORY_NUMBER],
              [IP_DESCRIPTION]
            )
    SELECT  [DIRECTORY_NUMBER],
            'Assigned'
    FROM    [testEnvironment]
    WHERE   ( [DIRECTORY_NUMBER] >= @initial
              AND [DIRECTORY_NUMBER] < @final
            );

    WITH    [cte_n]
              AS ( SELECT   @initial AS [DIRECTORY_NUMBER]
                   UNION ALL
                   SELECT   [cte_n].[DIRECTORY_NUMBER] + 1
                   FROM     [cte_n]
                   WHERE    [cte_n].[DIRECTORY_NUMBER] + 1 < @final
                 )
        INSERT  INTO @TEMP
                ( [DIRECTORY_NUMBER],
                  [IP_DESCRIPTION]
                )
        SELECT  [cte].[DIRECTORY_NUMBER],
                'Un-Assigned'
        FROM    [cte_n] [cte]
        WHERE   NOT EXISTS ( SELECT *
                             FROM   @TEMP [X]
                             WHERE  [X].[DIRECTORY_NUMBER] = [cte].[DIRECTORY_NUMBER] )
        OPTION  ( MAXRECURSION 0 );

    SELECT  [DIRECTORY_NUMBER],
            [IP_DESCRIPTION]
    INTO    [#resultsHolder2]
    FROM    @TEMP
    ORDER BY [DIRECTORY_NUMBER];

    SELECT  @Duration = CONVERT(CHAR(12), GETDATE() - @StartTime, 114);

    RAISERROR('Initial Query Duration: %s over %d rows',0,1,@Duration,@Size) WITH NOWAIT;

    DELETE  FROM @TABLE
    WHERE   [Initial] = @initial
            AND [Final] = @final;
END;

I inserted the result-set in to temporary tables to take the display time out of the equation. If you look at the messages tab, you will have something like looks like this: -

New Query Duration: 00:00:00:087 over 1000 rows
Initial Query Duration: 00:00:00:167 over 1000 rows
New Query Duration: 00:00:00:087 over 9998 rows
Initial Query Duration: 00:00:07:440 over 9998 rows
New Query Duration: 00:00:00:443 over 114307 rows

Note, I killed the query after 5 minutes, at which point the initial query had still not managed to return for 114,307 rows. Even without that part of the execution to compare with, we’re left with a fairly large improvement: -

Rows	Initial Query Duration	New Query Duration
1000	0.167	0.087
9998	7.440	0.087
114307	05:00.000 (killed query)	0.443

That’s an improvement with no indexes and no major design change.

Share on: