//
you're reading...
machine learning, mahout, statistics, Web Scraping and APIs

Book Recommendations from Beyond the Grave: A Mahout Example

In H P Lovecraft’s The Case of Charles Dexter Ward the villainous Curwen, having taken possession of the body of Charles Dexter Ward, uses a combination of chemistry and black magic to bring back from the dead the wisest people who have ever lived. He then tortures them for their secrets. Resurrection of the dead would be a bit of an over claim for machine learning but we can get some of the way: we can bring them back for book recommendations!

The Case of Charles Dexter Ward

It’s a grandiose claim and you can judge for yourself how successful it is. Really I just wanted an opportunity to experiment with Apache Mahout and to brush up on my Java. I thought I’d share the results.

Apache Mahout is a scalable machine learning library written in Java (see previous post). The recommender part of the library takes lists of users and items (for example it could be all users on a website, the books they purchased and perhaps even their ratings of those books). Then for any given user, say Smith, it works out the users that are most similar to that user and compiles a list of recommendations based on their purchases (ignoring of course anything Smith already has). It can get more complicated but that’s the gist of it.

So all we need to supply the algorithm with is a list of users and items.

Casting about for a test data set I thought it might be interesting to take the influencers data set that I extracted from Wikipedia for a previous post. This data set is a list of who influenced who across all the famous people covered on Wikipedia. For example the list of influences for T. S. Elliot is: Dante Alighieri, Matthew Arnold, William Shakespeare, Jules Laforgue, John Donne, T. E. Hulme, Ezra Pound, Charles Maurras and Charles Baudelaire.

Why not use x is influencing y as a proxy for y has read x? Probably true in a lot of cases. We can imagine that the famous dead have been shopping on Amazon for their favourite authors and we now have the database. Even better we can tap into their likes and dislikes to give us some personalised recommendations.

It works (sort of) and was very interesting to put together. Here’s a demo:

Say you liked English Romantic poets. Then you might submit the names:

http://simpleweb-deadrecommender.rhcloud.com/?names=john keats, william wordsworth, samuel taylor coleridge

Click on the link to see what the web app returns. You’ll find that you need to hit refresh on your browser whenever you submit something. Not sure why at the moment.

Or you might try some painters you like:

http://simpleweb-deadrecommender.rhcloud.com/?names=Pablo Picasso, Henri Matisse, Paul Gauguin

Or even just …

http://simpleweb-deadrecommender.rhcloud.com/?names=Bob Dylan

Or you could upset things a bit by adding in someone incongruous!

http://simpleweb-deadrecommender.rhcloud.com/?names=john keats, william wordsworth, samuel taylor coleridge, arthur c clarke

Have a go. All you need to do is add some names separated by commas after the names= part in the http request. Note although my code does some basic things to facilitate matching, the names will have to roughly match what is in wikipedia so if you’re getting an error just look the person up there.

What are we getting back?

The first list shows, in Mahout terminology, your user neighborhood, that is individuals in wikipedia who are judged to have similar tastes to you. I have limited the output to 20 though in fact the recommender is set to use 80. I will explain this in more detail in later posts. They are not in rank order (I haven’t figured out how to do this yet.)

The second list is the recommendations derived from the preferences of your ‘neighbours’. Again I’ll explain this in more detail later (or you could look at Mahout in Action where it is explained nicely)

The code is available on github

Sometimes the app goes dead for some reason and I have to restart it. If that happens just drop me a line and I’ll restart it.

I plan to do several more posts on how this was done. At the moment I’m thinking

  • A walk though of the code
  • An explanation of how the recommender was tuned
  • How it was deployed as a web app on Open Shift

But if anyone out there is interested in other things or has questions just let me know.

Simon

About these ads

About Simon Raper

I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog Stats

  • 315,929 hits

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 508 other followers

Follow

Get every new post delivered to your Inbox.

Join 508 other followers

%d bloggers like this: