I recently posted an on how to use a CTE to traverse a tree in a relational database on Reddit.
This started a lot of conversation, and people were wondering on the performance aspect of using recursive queries.
Well, I decided to take an extreme look. I'll pretend that Reddit had a very popular thread that had 5,000 comments on it. And, to really tax this query, I'll make it so that it is an extremely unbalanced tree 1000 comments deep.
For those following along, I'm going to use T-SQL, and everything in this post can be dropped directly into SQLServer 2005+ and ran.
I'm going to build a tree that looks like this:
Root
\
Comment 1
Comment 2
Comment 3
Comment 4
Comment 5
\
Comment 6
Comment 7
Comment 8
Comment 9
Comment 10
\
Comment 11
Comment 12
Comment 13
Comment 14
Comment 15 (This repeats to depth of 1,000)
This is the absolute worst case scenario, because I have a horrible unbalanced tree, and will have to recursively walk the entire thing.
So, to create the schema:
USE [Tree]
CREATE TABLE [dbo].[Tree]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[ParentID] [int] NULL,
CONSTRAINT [PK_Tree] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Tree] WITH CHECK ADD CONSTRAINT [FK_Tree_Tree] FOREIGN KEY([ParentID])
REFERENCES [dbo].[Tree] ([Id])
GO
ALTER TABLE [dbo].[Tree] CHECK CONSTRAINT [FK_Tree_Tree]
This is a simple table with the following attributes:
- Two Columns (Id, ParentId)
- Id is an identity column and the primary key
- ParentId is Nullable, and is a Foreign Key back to Id
Now, I'm going to insert my comment thread of doom into this table, I'm using some pretty tacky T-SQL to do this, but I'm lazy:
USE Tree;
DECLARE @depthCounter int;
DECLARE @lastNode int;
DECLARE @i int;
SET @DepthCounter = 0;
INSERT INTO Tree (ParentID)
VALUES
(
NULL
);
SET @LastNode = SCOPE_IDENTITY();
WHILE @DepthCounter < 1000
BEGIN
SET @i = 0;
WHILE @i < 5
BEGIN
INSERT INTO Tree (ParentID)
VALUES
(
@LastNode
);
SET @i = @i + 1;
END
SET @LastNode = SCOPE_IDENTITY();
SET @DepthCounter = @DepthCounter + 1;
END
With my tree of doom inserted, I'll run the query starting from the root:
Use Tree;
WITH CommentTree (ParentID, ID, Level)
AS
(
SELECT ParentID, ID, 0 as [Level]
FROM Tree
WHERE ParentID IS NULL
UNION ALL
SELECT c.ParentID, c.ID, ct.Level + 1
FROM Tree c
JOIN CommentTree ct ON ct.ID = c.ParentID
)
SELECT ParentID, ID, Level FROM CommentTree
ORDER BY ParentID, Level, ID OPTION (MAXRECURSION 5000)
On my home laptop (A modest 2ghz) running Microsoft SQLServer 2005 EXPRESS I get the following result to generate this dataset:
SQL Server Execution Times:
CPU time = 10998 ms, elapsed time = 11209 ms.
11 Seconds is an eternity, but this is a the worlds worst case. Now to try the same test data, but instead walking only 100 levels deep (500 comments):
Use Tree;
WITH CommentTree (ParentID, ID, Level)
AS
(
SELECT ParentID, ID, 0 as [Level]
FROM Tree
WHERE ParentID = 44339
UNION ALL
SELECT c.ParentID, c.ID, ct.Level + 1
FROM Tree c
JOIN CommentTree ct ON ct.ID = c.ParentID
)
SELECT ParentID, ID, Level FROM CommentTree
ORDER BY ParentID, Level, ID OPTION (MAXRECURSION 5000)
This returned 500 rows and the following speed stats:
SQL Server Execution Times:
CPU time = 1217 ms, elapsed time = 1223 ms.
This is still a unusual dataset, because it is still a very off balance tree.
So, my final benchmark will be an approximation of a regular reddit thread. 300 comments total, 100 root comments, with 2 replies each.
The setup script:
USE Tree;
DECLARE @depthCounter int;
DECLARE @lastNode int;
DECLARE @firstNode int;
SET @DepthCounter = 0;
INSERT INTO Tree (ParentID)
VALUES
(
NULL
);
SET @FirstNode = SCOPE_IDENTITY();
WHILE @DepthCounter < 100
BEGIN
INSERT INTO Tree (ParentID)
VALUES
(
@FirstNode
);
SET @LastNode = SCOPE_IDENTITY();
INSERT INTO Tree (ParentID)
VALUES
(
@LastNode
);
INSERT INTO Tree (ParentID)
VALUES
(
@LastNode
);
SET @DepthCounter = @DepthCounter + 1;
END
Now, when ran:
(301 row(s) affected)
SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 51 ms.
That's right, in the real world, this query took 51ms.
I think this shows that CTE (even running on commodity hardware) are perfectly adequate to handle nested trees on social forums.
Posted by Jonathan Holland on 6/6/2009.