« Microblog: Advice of the Day
On Handling Spam »


Tagging in Normalized Form

No blog is 'Web 2.0' without tagging, so today I added tagging to the blog engine. I originally sketched out a 3-table normalized schema (Posts, Tags, Post_Tags_XREF) but then began to second-guess myself. I had heard that tags and relational databases are a difficult things to accomplish efficiently.

A bit of Google searching, and I saw that most people just cheated and used a "tags" column and full text search. For my little blog, I didn't want to use such a hacked approach, so I went with my original 3NF idea.

DB Schema

This is actually the entire schema for this site, it will just continue to grow organically as I add features.

My first approach was to import the new tag tables into the LINQ-To-SQL designer, but the designer does not seem to understand the concept of a reference table and generated some truly brain-dead code. Adding a tag caused it to hit the database seventeen times. Unacceptable.

So, I decided to move all of the tag code into stored procedures.

Getting tags for a post was very simple:

-- =============================================

-- Author:        Jonathan Holland

-- Create date: 2/7/2009

-- Description:    Gets Tags Associated to a Post

-- =============================================

ALTER PROCEDURE GetTags

    @PostID        BIGINT

AS

BEGIN   

    SET NOCOUNT ON;

 

    SELECT

        t.id,

        t.[Name]

    FROM

        Tags t,

        Post_Tag_XREF pt

    WHERE

        pt.PostID = @PostID AND

        pt.TagID = t.id;

END

It was simply a single inner join, this will be very fast.

Getting posts for a tag was a little more complex. I decided that I did not need to have multiple tag searching for my site. If I did, this would have have involved parsing the input string into a temp table and then joining against it...Probably slow.

However, for my needs, I won't need that, as I can get away with one tag to many posts:

-- =============================================

-- Author:        Jonathan Holland

-- Create date: 2/7/2009

-- Description:    Gets Posts Associated to a Tag

-- =============================================

ALTER PROCEDURE GetTaggedPosts

    @TagName    NVARCHAR(50)

AS

BEGIN   

    SET NOCOUNT ON;

 

    DECLARE @TagId BIGINT;

 

    SELECT

        @TagId = t.ID

    FROM

        Tags t

    WHERE

        t.[Name] = @TagName;

 

    SELECT

        p.Id,

        p.Subject,

        p.[Text],

        p.[TimeStamp],

        p.[UrlKey]

    FROM

        Posts p,       

        Post_Tag_XREF pt

    WHERE

        pt.TagID = @TagId AND

        p.Id = pt.PostID

    ORDER BY p.[TimeStamp] DESC;

END

Finally, I needed a way to set the tags on each post. I wanted to be able to just pass in a VARCHAR of tags and have the procedure do all of the heavy lifting.

After a bit of work, I came up with this:

-- =============================================

-- Author:        Jonathan Holland

-- Create date: 2/7/2009

-- Description:    Tags a post with the passed in tags

-- =============================================

ALTER PROCEDURE [bloguser].[TagPost]

    @PostID    BIGINT,

    @Tags    NVARCHAR(500)

AS

BEGIN   

    SET NOCOUNT ON;

 

    -- Delete any tags currently on this post

    DELETE FROM Post_Tag_XREF WHERE PostID = @PostID;

 

    -- Split the tags into a temporary table       

    DECLARE @TagTable TABLE

    (

        TagName NVARCHAR(50)

    );

    INSERT INTO @TagTable (TagName)

        SELECT String AS TagName

            FROM SplitString(@Tags);

 

    -- Add any new tags to the tag table and associate all the tags in the XREF table

    DECLARE @CurrentTag    NVARCHAR(50);

    DECLARE @CurrentId BIGINT;

 

    DECLARE TagList CURSOR FOR

        SELECT TagName FROM @TagTable;

 

    OPEN TagList;

    FETCH NEXT FROM TagList

        INTO @CurrentTag;

 

    WHILE @@FETCH_STATUS = 0

    BEGIN

        -- If there isn't a tag with this name yet, add one.

        IF NOT EXISTS(SELECT t.id FROM Tags t WHERE t.[name] = @CurrentTag)

        BEGIN

            INSERT INTO Tags ([Name])

                VALUES (@CurrentTag);

        END

 

        SELECT @CurrentID = t.ID

            FROM Tags t

            WHERE t.[name] = @CurrentTag;

 

        -- Create the Post/Tag Assocation

        INSERT INTO Post_Tag_XREF (TagID, PostID)

            VALUES (@CurrentID, @PostID);       

 

        FETCH NEXT FROM TagList

            INTO @CurrentTag;

    END

 

    -- Cleanup the Cursor

    CLOSE TagList;

    DEALLOCATE TagList;       

 

    -- Now delete any tags that are no longer associated to ANY posts   

    DELETE FROM Tags WHERE

        Tags.ID NOT IN

        (

            SELECT t.ID

            FROM

                Tags t,

                Post_Tag_XREF pt

            WHERE

                t.ID = pt.TagID

        );

END

While intimidating at first, this is actually fairly straight forward, I'll run through the procedure bit by bit:

DELETE FROM Post_Tag_XREF WHERE PostID = @PostID;

I assume that the tags passed in are the new tags for this post, so the first thing to-do is to blow away any existing tag assocations.

-- Split the tags into a temporary table       

DECLARE @TagTable TABLE

(

    TagName NVARCHAR(50)

);

INSERT INTO @TagTable (TagName)

    SELECT String AS TagName

        FROM SplitString(@Tags);

Next I create a table variable with a single column (TagName) and split the contents of @Tags into it, using a custom function (More on that later).

The next chunk of code is boilerplate code for setting up a cursor. I personally believe that cursors should be avoided in most situations, but because I am using a cursor on a locally scoped table variable, it will not cause any locking or concurrency issues and is a fine solution.

With the tags loaded into a table variable, I can walk through them one by one and insert them into the appropriate tables:

    WHILE @@FETCH_STATUS = 0

    BEGIN

        -- If there isn't a tag with this name yet, add one.

        IF NOT EXISTS(SELECT t.id FROM Tags t WHERE t.[name] = @CurrentTag)

        BEGIN

            INSERT INTO Tags ([Name])

                VALUES (@CurrentTag);

        END

 

        SELECT @CurrentID = t.ID

            FROM Tags t

            WHERE t.[name] = @CurrentTag;

 

        -- Create the Post/Tag Assocation

        INSERT INTO Post_Tag_XREF (TagID, PostID)

            VALUES (@CurrentID, @PostID);       

 

        FETCH NEXT FROM TagList

            INTO @CurrentTag;

    END

I ensure that there is an entry in the Tag table for each tag, then create a link between that tag and the current post by inserting a row into the Post_Tag_XREF table.

Finally, I may have orphaned some tags when I blew away the existing assocations at the start of the procedure, so the last step is to clean up any orphans:

    DELETE FROM Tags WHERE

        Tags.ID NOT IN

        (

            SELECT t.ID

            FROM

                Tags t,

                Post_Tag_XREF pt

            WHERE

                t.ID = pt.TagID

        );

This is good housekeeping and prevents the database from becoming bloated with unused tags.

The entire procedure is not so bad when broken down into chunks, and runs lightening fast. I used LINQ-To-SQL to generate a proxy method, and applying tags is as simple as:

 DB.TagPost(CurrentPost.Id, txtTags.Text.Trim());

You may have noticed that I called a function (SplitString) in the stored procedure, this is a custom function that I found online and modified for my needs...for those following along, here it is:

-- =============================================

-- Author:        Jonathan Holland

-- Based On:    Robert Verpalen's version

-- Create date: 2/7/2009

-- Description:    Splits a String into a resultset

-- =============================================

ALTER FUNCTION SplitString

(

    @String varchar(2048)   

)

 

RETURNS TABLE AS RETURN

 

WITH INDICES AS

(

    SELECT 0 S, 1 E

        UNION ALL

    SELECT E, CHARINDEX(' ', @String, E) + 1

        FROM INDICES

    WHERE E > S

)

 

SELECT

    SUBSTRING(

        @String,

        S,

        CASE WHEN

            E > 1

                THEN E - S - 1

            ELSE

                LEN(@String) - S + 1

        END)

        String       

FROM

    INDICES WHERE S > 0

Despite being initially somewhat complex, tagging is a largely solved problem on a small scale blog engine. I'd imagine that in larger situations tagging should probably be handled by a full text search engine such as Lucene.

Posted by Jonathan Holland on 2/8/2009.

Tags: Tips-and-Tricks   SQL   LINQ   Database-Design   Tagging   Sprocs

Comments:

Nice.

Have you tried caching the Tags table? Then you wouldn't need to hit the Tags table again and again.

Gravatar Posted by cherouvim on 2/8/2009.

Nice, it seems like people don't want to grok SQL anymore, so it is refreshing to see someone who really gets it.

Gravatar Posted by Stewart on 2/8/2009.

Why does T-SQL not include a built in SplitString function?

Gravatar Posted by WAT on 2/9/2009.

Comments are closed on this post.