November 24, 2003

Jon Udell Investigates Bayesian Categorizers

Jon does his usual thorough job of describing his experiments with categorizing his blog posts, including pointers to the tools he used.

He finds that it doesn't do a perfect job of finding the right category for some of the posts, but there's often a quite plausible reason for the incorrect choice. I think this "fuzzy" categorization is a benefit, but I'm looking to solve a slightly different problem.

All the posts I make to this blog are categorized. I manually choose a category when I write the entry, and it's not too onerous a task (i.e. it's something that I don't mind doing, so it's easily captured metadata). Sometimes things don't fit simply into one category, and although I can assign things to more than one category, that's harder and I've never done it.

Another problem is the number of posts in any one category. Already the "Computers" category has sixty-six entries; too many to browse through for an item. Subdividing a category when it becomes too big would be the best solution, but presents the problem of re-categorizing all the existing posts.

What I'd like is for Movable Type to notice when I cross some entry-count threshold (say, when there are more than twenty posts) in a category, and suggest that I subdivide that category. Then, I just use the new more specific categories for new posts and leave the old posts as is. The categories would then build up over time into a hierarchy of categories, and the top-level ones would be listed where my category list is now - selecting one of the categories would give you a Yahoo! style hierarchy to navigate for the postings.

This is where, I think, Bayesian filtering comes in. The "category browser" uses filtering to show the posts for any particular category, and ignores the category that I choose. My choice of category is used to train the filter on what I think should go into a category, but the filter chooses which posts are valid for the category when someone is browsing through them, and if a post scores well for more than one category, then that's fine, it can live in both categories. The only exception to the multi-categorization is that posts should only live in the most specific category in the hierarchy (e.g. something scores well for "Computers" and "Computers/MobileUI" will only appear in "Computers/MobileUI"). This means that all the old posts get recategorized automatically, and cross-category posts just appear more than once.

Yet more things to play with in my non-existent free time...

Posted by Adrian at November 24, 2003 01:21 PM | TrackBack

This blog post is on the personal blog of Adrian McEwen. If you want to explore the site a bit further, it might be worth having a look at the most recent entries or look through the archives or categories over on the left.

You can receive updates whenever a new post is written by subscribing to the recent posts RSS feed or

Post a comment

Remember personal info?

Note: I'm running the MT-Keystrokes plugin to filter out spam comments, which unfortunately means you have to have Javascript turned on to be able to comment.