Technorati blog claim post
In order to confirm that I really own my blog, Technorati need me to make a post with a link to my Technorati Profile.
In order to confirm that I really own my blog, Technorati need me to make a post with a link to my Technorati Profile.
As you may be aware many blog directories allow you to inform them of updates to your blog using an XMLRPC interface method.
If you have already installed the Umbraco blog package, you may not be aware that it uses this method to notify technorati.com. Go to the site and search for some of your blog content and it should be there.
This is a nice piece of functionality that potentially drives traffic to your site, however there are hundreds of blog directories out there and you can't tell Umbraco's blog package to ping additional URL's.
I created an Umbraco action handler. When a blog post is published - or re-published - it reads a comma delimited list of URL's from a key called fmBlogPingUrls. The pinging is done as a background thread so that the GUI doesn't have to wait for all of the ping requests to be processed.
If your Umbraco blog is a standard install, you can just drop the DLL into your bin directory and add the previously mentioned key into your web.config file to get started.
If you have changed document type names or property names then you will need to open up the source and modify a few constants before building. The action places quite a lot of debug information about the ping results in the Umbraco log. You may wish to disable this.
Finally, you'll need some URL's to ping:
http://elliottback.com/wp/archives/2004/11/21/a-list-of-rpc-and-rpc2-to-ping/
This content intelligence thing seems to be turning into a bit of a mini series.
The last addition to this example for the foreseeable future is Stemming - the reduction of a word to its root form.
In case you are thinking WTF, think about the relationship between training and train, discussion and discuss and so on. If you want to read more about how machines can perform stemming see http://www.comp.lancs.ac.uk/computing/research/stemming/general/ but personally, I am happy to be aware of stemming algorithms and the fact that they can be useful when classifying documents.
If you click through to the detail view of any of my blog posts you'll see that some words have been stemmed in the classification output - the root form appears in brackets after the word.
After playing with NClassifiers PorterStemmer, I found it was often too harsh, reducing words so they made no sense.
As a solution I applied both the PorterStemmer and KStemmer which is available from the lucene.net site. I select the shortest correctly spelled word as the root word. The results are pretty satisfactory, but there will be mistakes - use stemmed to us for example.
So why is this useful? Imagine someone visits your site and searches for the word 'classification'. Using stemming you can reduce classification to classify, determine that 'classifier' has the same root word and return documents matching classify, classification and classifiers.
Read it a few times, it does make sense. Honest!
So now what? Well, I shelf the content intelligence thing for the moment. As with most of my prototyping, I get to the stage where I could sell it as a product if I find a client to fund and then park it.
If I have some free time I may well try and move all of this prototype functionality into an ActionHandler and then add some personalisation for members who are logged on to my site.
Note: When reading this post, click through to the detail view so that you can see the output that is being discussed.
Ever seen pages like your flickr tag list that use the pretty cool effect of displaying the tag in a size that relates to its frequency?
As well as classifying your posts NClassifier makes it easy to get information on how often words occur in a piece of text. If you click through to any of the detail page of any blog post you'll see that yesterdays classification now lists information about how often words occur in the post. It would be a relatively easy job to randomize the order in which the words appear and apply a font size based on how many times they occur in a document or even across an entire website.
We have to remember that the classifier has no knowledge of the meaning of words, so to improve its output we employ few common search engine methods.
Stop Words: these are a list of words that should not be considered as they are not relevant to the classification of the document e.g. (a, the, if, you, he, she). Stop words are somtimes called noise words.
Custom words: Non dictionary words that you want to be included in classification e.g. (Umbraco, c#, Darren).
Significant words: Words that you want to add weight to, in my case Umbraco, Pho and words that relate to these topics.
You can see all of these in the updated config file. Experience has taught me to also exclude single letter words and numeric strings.
The NClassifier word list is then passed through NetSpell and sorted by number of occurrences to give the output that you see.
Finally, although useful, these classifiers need a lot of training to be useful. As you can probably see, you will end up adding more and more words to your list of stop words and gradually the summarization becomes more concise. To work around the need to train your classifiers you can only consider significant words in the output.
This example is very simple, significant words are not weighted, so the only factor that determines the importance of a word in a document is the number of times that it occurs. Weighting can be very complex, so I'll leave it for now.
As with my first NClassifier example, I think it would be best to employ this as an ActionHandler if being used within Umbraco. To employ site wide word counts you would need a data store with the columns documentId, wordId, occurenceCount in order that you could easily update your information upon re-publish.
A while back I read Imail's post on NClassifier. I've always been interested in content intelligence and decided to look into auto tagging blog posts and creating document summaries.
I'm not new to this game and I still feel that humans do a much better job of summarising text than machines, however the tagging using the BayesianClassifier works quite well.
I've put together a very simple demo which classifies blog posts based on this simple xml configuration. If you click through to the detail view of any blog post, you'll see the result of the classification just before the comments.
Ultimately, I'd probably implement this as an Umbraco actionhandler which sets metadata against items as they are published, but for now classification runs on each request.
If you are wondering what the purpose of this is, think about the technology behind Amazon recommendations and numerous sites that implement 'you may be interested in this' functionality. Once your content is classified, it isn't too hard to track which tags that your users are looking at most and assemble custom content for them.
I'd avoided looking at spell checking in Umbraco as I thought it would quite time consuming.
I was wrong... Thanks to the excellent NetSpell and it's built in web ui demo, it only took about 30 minutes to hack something together.
Check out the screencast.
The checker works on dictionary files, so it will do languages other than English. I implemented it to check specific Umbraco fields, though it would be relatively simple to have it check all fields in a document type.
I'm hoping Niels will snap this one up and integrate it, as I'm not known for my UI design!
Next on the list is looking at NClassifier which Ismail found. I'm hoping that it will help me add some content intelligence to Umbraco - for example, auto suggesting keywords and document summaries.
My RSS feed is now at http://www.darren-ferguson.com/?altTemplate=RSS
As those of you who subscribe probably realised, there is a mysterious bug somewhere in the RSS extension for Umbraco that causes duplicate posts to be dowloaded to RSS clients.
This new implementation should fix the issue.
I have added a Personal/Non-profit license to Umbraco backup.
This license allows you to use all of the features of Umbraco Backup for £15 (US$28 or €22).
In order to take advantage of this license, your site must be a personal website, a registered charity or a non profit organisation.
We finally got around to building the new website for my company at fergusonmoriyama.com
The branding was developed by Yacada and also includes our stationery - you can see the stationery on flickr, but my scanner is pretty poor quality.
There is still quite a lot to come in terms of content, but we are all delighted with the results.
Are any of you Umbraco people on LinkedIn? Somebody just introduced me to it and it looks very useful.
Although I feel quite ridiculous saying this.... Drop me an email and we'll 'connect'.
Darren Ferguson - Umbraco, Dev, Pho is published with Umbraco 3.0.5