标签数据库的设计

Tags: Database schemas

Recently, on del.icio.us mailinglist, I asked the question “Does anyone know the database schema of del.icio.us?”. I got a few private responses so I wanted to share the knowledge with the world.

The Problem: You want to have a database schema where you can tag a bookmark (or a blog post or whatever) with as many tags as you want. Later then, you want to run queries to constrain the bookmarks to a union orintersection of tags. You also want to exclude (say: minus) some tags from the search result.

Apparently there are three different solutions (Attention: If you are building a websites that allows users to tag, be sure to have a look at my performance tests as performance seems to be a problem on larger scaled sites.)

“MySQLicious” solution


In this solution, the schema has got just one table, it is denormalized.

I named this solution “MySQLicious solution” because MySQLiciousimports del.icio.us data into a table with this structure.

Intersection (AND)

Query for “search+webservice+semweb”:
SELECT * 
FROM `delicious` 
WHERE tags LIKE "%search%" 
AND tags LIKE "%webservice%" 
AND tags LIKE "%semweb%"

Union (OR)

Query for “search|webservice|semweb”:

SELECT * 
FROM `delicious` 
WHERE tags LIKE "%search%" 
OR tags LIKE "%webservice%" 
OR tags LIKE "%semweb%"

Minus

Query for “search+webservice-semweb”

SELECT * 
FROM `delicious` 
WHERE tags LIKE "%search%" 
AND tags LIKE "%webservice%" 
AND tags NOT LIKE "%semweb%"

Conclusion

The advantages of this solution:

  • just one table
  • the queries are very straightforward
  • one can also achieve results via fulltextsearch. That might be a little faster.
  • queries are quite slow according to some commenters. Fulltext search would speed up a bit. I did some performance tests to prove that.
  • In my follow up post I dealt with MySQL fulltext concerning tagging.


Disadvantages:

  • You have a limit on the number of tags per bookmark. Normally you use a 256byte field in your DB (VARCHAR). Otherwise, if you took a text field or similar, the query times would slow down, I suppose
  • Patrice noticed that LIKE "%search" will also find tags with “websearch”. If you alter the query to LIKE " %search% " you end up having a messy solution: You have to add a space to the beginning of the tags value to make this work.

“Scuttle” solution

Scuttle organizes its data in two tables. That table “scCategories” is the “tag”-table and has got a foreign key to the “bookmark”-table. 

Intersection (AND)

Query for “bookmark+webservice+semweb”:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3

First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).

Union (OR)

Query for “bookmark|webservice|semweb”:

Just leave out the HAVING clause and you have union:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

Minus (Exclusion)

Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
Credits go to Rhomboid for helping me out with this query.

Conclusion

I guess the main advantage of this solution is that it is more normalized than the first solution, and that you can have unlimited number of tags per bookmark.

“Toxi” solution

Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used bywordpress.

The queries are quite the same as in the “scuttle” solution.

Intersection (AND)

Query for “bookmark+webservice+semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3

Union (OR)

Query for “bookmark|webservice|semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id

Minus (Exclusion)

Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id 
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
Credits go to Rhomboid for helping me out with this query.

Conclusion

The advantages of this solution:

  • You can save extra information on each tag (description, tag hierarchy, …)
  • This is the most normalized solution (that is, if you go for 3NF: take this one :-)

Disadvantages:

  • When altering or deleting bookmarks you can end up with tag-orphans.

If you want to have more complicated queries like (bookmarks OR bookmark) AND (webservice or WS) AND NOT (semweb or semanticweb) the queries tend to become very complicated. In these cases I suggest the following query/computation process:

  1. Run a query for each tag appearing in your “tag-query”: SELECT b.id FROM tagmap bt, bookmark b, tag t WHERE bt.tag_id = t.tag_id AND b.id = bt.bookmark_id AND t.name = "semweb"
  2. Put each id-set from the result into an array (that is: in your favourite coding language). You could cache this arrays if you want..
  3. Constrain the arrays with union or intersection or whatever.

In this way, you can also do queries like (del.icio.us|delicious)+(semweb|semantic_web)-search. This type of queries (that is: the brackets) cannot be done by using the denormalized “MySQLicious solution”. 
This is the most flexible data structure and I guess it should scale pretty good (that is: if you do some caching).

Update May, 2006. This arcticle got quite some attention. I wasn’t really prepared for that! It seems people keep referring to it and even some new sites that allow tagging give credit to my articles. I think the real credit goes to the contributers of the different schemas: MySQLicious, scuttle, Toxi and to all the contributors of the comments (be sure to read them!)

P.S. Thanks to Toxi for sending me the queries for the three-table-schema, Benjamin Reitzammer for pointing me to a loughing meme article (a good reference for tag queries) and powerlinux for pointing me to scuttle.
Further reading

  • Taglist: a mailing list dedicated to schemas with tagging
  • Tagschema: A blog dedicated to tagging schemas
  • Tag-related Queries on Snippets
  • Freetag is a php “library” with which you can add tags to whatever object you like. It actually uses the “toxi schema”.
  • Hammy gives an insight how he did his tagging system with “less DB and more code” (that is: regular expressions), interesting!
  • Brad Choate has got some ideas which tag queries should be possible
  • Feedmaker has written a sort of reply to this article

你可能感兴趣的:(标签数据库的设计)