No blog is 'Web 2.0' without tagging, so today I added tagging to the blog engine. I originally sketched out a 3-table normalized schema (Posts, Tags, Post_Tags_XREF) but then began to second-guess myself. I had heard that tags and relational databases are a difficult things to accomplish efficiently.
A bit of Google searching, and I saw that most people just cheated and used a "tags" column and full text search. For my little blog, I didn't want to use such a hacked approach, so I went with my original 3NF idea.
This is actually the entire schema for this site, it will just continue to grow organically as I add features.
My first approach was to import the new tag tables into the LINQ-To-SQL designer, but the designer does not seem to understand the concept of a reference table and generated some truly brain-dead code. Adding a tag caused it to hit the database seventeen times. Unacceptable.
So, I decided to move all of the tag code into stored procedures.
Getting tags for a post was very simple:
-- =============================================
-- Author: Jonathan Holland
-- Create date: 2/7/2009
-- Description: Gets Tags Associated to a Post
-- =============================================
ALTER PROCEDURE GetTags
@PostID BIGINT
AS
BEGIN
SET NOCOUNT ON;
SELECT
t.id,
t.[Name]
FROM
Tags t,
Post_Tag_XREF pt
WHERE
pt.PostID = @PostID AND
pt.TagID = t.id;
END
It was simply a single inner join, this will be very fast.
Getting posts for a tag was a little more complex. I decided that I did not need to have multiple tag searching for my site. If I did, this would have have involved parsing the input string into a temp table and then joining against it...Probably slow.
However, for my needs, I won't need that, as I can get away with one tag to many posts:
-- =============================================
-- Author: Jonathan Holland
-- Create date: 2/7/2009
-- Description: Gets Posts Associated to a Tag
-- =============================================
ALTER PROCEDURE GetTaggedPosts
@TagName NVARCHAR(50)
AS
BEGIN
SET NOCOUNT ON;
DECLARE @TagId BIGINT;
SELECT
@TagId = t.ID
FROM
Tags t
WHERE
t.[Name] = @TagName;
SELECT
p.Id,
p.Subject,
p.[Text],
p.[TimeStamp],
p.[UrlKey]
FROM
Posts p,
Post_Tag_XREF pt
WHERE
pt.TagID = @TagId AND
p.Id = pt.PostID
ORDER BY p.[TimeStamp] DESC;
END
Finally, I needed a way to set the tags on each post. I wanted to be able to just pass in a VARCHAR of tags and have the procedure do all of the heavy lifting.
After a bit of work, I came up with this:
-- =============================================
-- Author: Jonathan Holland
-- Create date: 2/7/2009
-- Description: Tags a post with the passed in tags
-- =============================================
ALTER PROCEDURE [bloguser].[TagPost]
@PostID BIGINT,
@Tags NVARCHAR(500)
AS
BEGIN
SET NOCOUNT ON;
-- Delete any tags currently on this post
DELETE FROM Post_Tag_XREF WHERE PostID = @PostID;
-- Split the tags into a temporary table
DECLARE @TagTable TABLE
(
TagName NVARCHAR(50)
);
INSERT INTO @TagTable (TagName)
SELECT String AS TagName
FROM SplitString(@Tags);
-- Add any new tags to the tag table and associate all the tags in the XREF table
DECLARE @CurrentTag NVARCHAR(50);
DECLARE @CurrentId BIGINT;
DECLARE TagList CURSOR FOR
SELECT TagName FROM @TagTable;
OPEN TagList;
FETCH NEXT FROM TagList
INTO @CurrentTag;
WHILE @@FETCH_STATUS = 0
BEGIN
-- If there isn't a tag with this name yet, add one.
IF NOT EXISTS(SELECT t.id FROM Tags t WHERE t.[name] = @CurrentTag)
BEGIN
INSERT INTO Tags ([Name])
VALUES (@CurrentTag);
END
SELECT @CurrentID = t.ID
FROM Tags t
WHERE t.[name] = @CurrentTag;
-- Create the Post/Tag Assocation
INSERT INTO Post_Tag_XREF (TagID, PostID)
VALUES (@CurrentID, @PostID);
FETCH NEXT FROM TagList
INTO @CurrentTag;
END
-- Cleanup the Cursor
CLOSE TagList;
DEALLOCATE TagList;
-- Now delete any tags that are no longer associated to ANY posts
DELETE FROM Tags WHERE
Tags.ID NOT IN
(
SELECT t.ID
FROM
Tags t,
Post_Tag_XREF pt
WHERE
t.ID = pt.TagID
);
END
While intimidating at first, this is actually fairly straight forward, I'll run through the procedure bit by bit:
DELETE FROM Post_Tag_XREF WHERE PostID = @PostID;
I assume that the tags passed in are the new tags for this post, so the first thing to-do is to blow away any existing tag assocations.
-- Split the tags into a temporary table
DECLARE @TagTable TABLE
(
TagName NVARCHAR(50)
);
INSERT INTO @TagTable (TagName)
SELECT String AS TagName
FROM SplitString(@Tags);
Next I create a table variable with a single column (TagName) and split the contents of @Tags into it, using a custom function (More on that later).
The next chunk of code is boilerplate code for setting up a cursor. I personally believe that cursors should be avoided in most situations, but because I am using a cursor on a locally scoped table variable, it will not cause any locking or concurrency issues and is a fine solution.
With the tags loaded into a table variable, I can walk through them one by one and insert them into the appropriate tables:
WHILE @@FETCH_STATUS = 0
BEGIN
-- If there isn't a tag with this name yet, add one.
IF NOT EXISTS(SELECT t.id FROM Tags t WHERE t.[name] = @CurrentTag)
BEGIN
INSERT INTO Tags ([Name])
VALUES (@CurrentTag);
END
SELECT @CurrentID = t.ID
FROM Tags t
WHERE t.[name] = @CurrentTag;
-- Create the Post/Tag Assocation
INSERT INTO Post_Tag_XREF (TagID, PostID)
VALUES (@CurrentID, @PostID);
FETCH NEXT FROM TagList
INTO @CurrentTag;
END
I ensure that there is an entry in the Tag table for each tag, then create a link between that tag and the current post by inserting a row into the Post_Tag_XREF table.
Finally, I may have orphaned some tags when I blew away the existing assocations at the start of the procedure, so the last step is to clean up any orphans:
DELETE FROM Tags WHERE
Tags.ID NOT IN
(
SELECT t.ID
FROM
Tags t,
Post_Tag_XREF pt
WHERE
t.ID = pt.TagID
);
This is good housekeeping and prevents the database from becoming bloated with unused tags.
The entire procedure is not so bad when broken down into chunks, and runs lightening fast. I used LINQ-To-SQL to generate a proxy method, and applying tags is as simple as:
DB.TagPost(CurrentPost.Id, txtTags.Text.Trim());
You may have noticed that I called a function (SplitString) in the stored procedure, this is a custom function that I found online and modified for my needs...for those following along, here it is:
-- =============================================
-- Author: Jonathan Holland
-- Based On: Robert Verpalen's version
-- Create date: 2/7/2009
-- Description: Splits a String into a resultset
-- =============================================
ALTER FUNCTION SplitString
(
@String varchar(2048)
)
RETURNS TABLE AS RETURN
WITH INDICES AS
(
SELECT 0 S, 1 E
UNION ALL
SELECT E, CHARINDEX(' ', @String, E) + 1
FROM INDICES
WHERE E > S
)
SELECT
SUBSTRING(
@String,
S,
CASE WHEN
E > 1
THEN E - S - 1
ELSE
LEN(@String) - S + 1
END)
String
FROM
INDICES WHERE S > 0
Despite being initially somewhat complex, tagging is a largely solved problem on a small scale blog engine. I'd imagine that in larger situations tagging should probably be handled by a full text search engine such as Lucene.
Posted by Jonathan Holland on 2/8/2009.