Darren Ferguson - Umbraco, Dev, PhoRSS

Friday, September 22, 2006

More on NClassifier and content intelligence

Filed under: Umbraco, Work - by Darren Ferguson @ 10:03 AM

Note: When reading this post, click through to the detail view so that you can see the output that is being discussed. 

Ever seen pages like your flickr tag list that use the pretty cool effect of displaying the tag in a size that relates to its frequency?

As well as classifying your posts NClassifier makes it easy to get information on how often words occur in a piece of text. If you click through to any of the detail page of any blog post you'll see that yesterdays classification now lists information about how often words occur in the post. It would be a relatively easy job to randomize the order in which the words appear and apply a font size based on how many times they occur in a document or even across an entire website.

We have to remember that the classifier has no knowledge of the meaning of words, so to improve its output we employ few common search engine methods.

Stop Words: these are a list of words that should not be considered as they are not relevant to the classification of the document e.g. (a, the, if, you, he, she). Stop words are somtimes called noise words.

Custom words: Non dictionary words that you want to be included in classification e.g. (Umbraco, c#, Darren).

Significant words: Words that you want to add weight to, in my case Umbraco, Pho and words that relate to these topics.

You can see all of these in the updated config file. Experience has taught me to also exclude single letter words and numeric strings.

The NClassifier word list is then passed through NetSpell and sorted by number of occurrences to give the output that you see.

Finally, although useful, these classifiers need a lot of training to be useful. As you can probably see, you will end up adding more and more words to your list of stop words and gradually the summarization becomes more concise. To work around the need to train your classifiers you can only consider significant words in the output.

This example is very simple, significant words are not weighted, so the only factor that determines the importance of a word in a document is the number of times that it occurs. Weighting can be very complex, so I'll leave it for now.

As with my first NClassifier example, I think it would be best to employ this as an ActionHandler if being used within Umbraco. To employ site wide word counts you would need a data store with the columns documentId, wordId, occurenceCount in order that you could easily update your information upon re-publish.

Thursday, September 21, 2006

NClassifier: Auto tag posts

Filed under: Umbraco, Work - by Darren Ferguson @ 4:40 PM

A while back I read Imail's post on NClassifier. I've always been interested in content intelligence and decided to look into auto tagging blog posts and creating document summaries.

I'm not new to this game and I still feel that humans do a much better job of summarising text than machines, however the tagging using the BayesianClassifier works quite well.

I've put together a very simple demo which classifies blog posts based on this simple xml configuration. If you click through to the detail view of any blog post, you'll see the result of the classification just before the comments.

Ultimately, I'd probably implement this as an Umbraco actionhandler which sets metadata against items as they are published, but for now classification runs on each request.

If you are wondering what the purpose of this is, think about the technology behind Amazon recommendations and numerous sites that implement 'you may be interested in this' functionality. Once your content is classified, it isn't too hard to track which tags that your users are looking at most and assemble custom content for them.

Thursday, September 21, 2006

Integrated spell checker for Umbraco

Filed under: Umbraco - by Darren Ferguson @ 11:29 AM

I'd avoided looking at spell checking in Umbraco as I thought it would quite time consuming.

I was wrong... Thanks to the excellent NetSpell and it's built in web ui demo, it only took about 30 minutes to hack something together.

Check out the screencast.

The checker works on dictionary files, so it will do languages other than English. I implemented it to check specific Umbraco fields, though it would be relatively simple to have it check all fields in a document type.

I'm hoping Niels will snap this one up and integrate it, as I'm not known for my UI design!

Next on the list is looking at NClassifier which Ismail found. I'm hoping that it will help me add some content intelligence to Umbraco - for example, auto suggesting keywords and document summaries.

Wednesday, September 20, 2006

Updated RSS feed URL

Filed under: Umbraco, Misc - by Darren Ferguson @ 5:30 PM

My RSS feed is now at http://www.darren-ferguson.com/?altTemplate=RSS

As those of you who subscribe probably realised, there is a mysterious bug somewhere in the RSS extension for Umbraco that causes duplicate posts to be dowloaded to RSS clients.

This new implementation should fix the issue.

Tuesday, September 19, 2006

Personal/Non-profit license for Umbraco backup

Filed under: Umbraco, Umbraco Backup - by Darren Ferguson @ 9:24 AM

I have added a Personal/Non-profit license to Umbraco backup.

This license allows you to use all of the features of Umbraco Backup for £15 (US$28 or €22).

In order to take advantage of this license, your site must be a personal website, a registered charity or a non profit organisation.

Monday, September 18, 2006

FergusonMoriyama.com: built with Umbraco

Filed under: Umbraco, Work - by Darren Ferguson @ 1:07 PM

We finally got around to building the new website for my company at fergusonmoriyama.com

The branding was developed by Yacada and also includes our stationery - you can see the stationery on flickr, but my scanner is pretty poor quality.

There is still quite a lot to come in terms of content, but we are all delighted with the results.

Saturday, September 16, 2006

Any Umbraco people on LinkedIn?

Filed under: Umbraco, Misc - by Darren Ferguson @ 12:14 PM

Are any of you Umbraco people on LinkedIn? Somebody just introduced me to it and it looks very useful.

Although I feel quite ridiculous saying this.... Drop me an email and we'll 'connect'.

Tuesday, September 12, 2006

Solution: Help DRM and forward lock

Filed under: Work - by Darren Ferguson @ 4:47 PM

I had a few mails asking if I resolved the forward lock and C# problem that I was having. The working code is below, I am not sure if it is the best way to do this, but it does work.

Tuesday, September 12, 2006

Reporting dead links in Multiple content picker

Filed under: Umbraco - by Darren Ferguson @ 3:42 PM

The Umbraco multi content picker is a wonderful little control, but can sometimes become a bit of a maintenance nightmare. The example below gives you an idea of how to implement dead link notification within XSLT.

The other way to deal with this would be have an action handler checking document integrity each time a publish occured, but this is a nice quick fix if you have a relatively small amount of content.

I store my email addresses as Umbraco dictionary items, but you could always hard code them in your XSLT.

Tuesday, September 12, 2006

Thanh Binh four chilli rated

Filed under: Pho, Bun Bo Hue - by Darren Ferguson @ 9:12 AM

Visisted: 10/09/2006

Thanh Binh (Camden) - Four chilli rated

Four chilli rated

http://www.london-eating.co.uk/3257.htm
14 Chalk Farm Road, London, NW1 8AG

Nice cafe style place in Camden. Suprisignly quiet despite the mayhem of the market outside. They only had Pho (no hue style), though the chilli sauce soon resolved that.

Good for: Very tasty stock and fresh herbs.
Bad for: Parking - we got a £50 on a sunday (always read the signs).

Darren Ferguson - Umbraco, Dev, Pho is published with Umbraco