I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. As Data Scientist at Channel 4 my role is to develop machine learning solutions that allow the channel to build a deeper relationship with the viewer and innovate the way advertising is traded and work on supporting the creative side of the business. My current interests are in scalable machine learning (Mahout, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 270 K visits and was mentioned in Flowing Data, I09, and the online editions of The New York Times and The New Yorker.
simonraper has written 23 posts for Drunks&Lampposts

Buster – a new R package for bagging hierarchical clustering

I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could … Continue reading

Picturing the output of a neural net

Another one from my new site coppelia.io. Some time ago during a training session a colleague asked me what a surface plot of a two input neural net would look like. That is, if you have two inputs x_1 and x_2 and plot the output y as a surface what do you get? We thought … Continue reading

The Analyst’s Toolbox

Another post from my new website coppelia.io There are hundreds, maybe thousands, of open source/free/online tools out there that form part of the analyst’s toolbox. Here’s what I have on my mac for day to day work. Read more..

Quick start Hadoop and Hive for analysts

Here’s another post from my new website, coppelia.io. The problem You have a huge data set to crunch in a small amount of time. You’ve heard about the advantages of map reduce for processing data in parallel and you’d quite like to have a go with Hadoop to see what all the fuss is about. … Continue reading

Include uncertainty in a financial model

Here’s a post that appears on my new website, coppelia.io. The problem You’ve been asked to calculate some figure or other (e.g. end of year revenue, average customer lifetime value) based on numbers supplied from various parts of the business. You know how to make the calculation but what bothers you is that some of … Continue reading

Freehand Diagrams with Adobe Ideas

Freehand diagrams have two big virtues: they are quick and they are unconstrained. I used to use a notebook (see What are degrees of freedom) but recently I got an ipad and then I found Adobe Ideas. It’s completely free and has just the right level of complexity for getting ideas down fast. It takes … Continue reading

A confused tangle

A confusion matrix is a confusing thing. There’s a surprising number of useful statistics that can be built out of just four numbers and the links between them are not always obvious. The terminology doesn’t help (is a true negative an observation that is truly in the class but classified negative or one that is … Continue reading

Box Me

Here’s a short R function I wrote to turn a long data set into a wide one for viewing. It’s not the most exciting function ever but I find it quite useful when my screen is wide and short. It simply cuts the data set horizontally into equal size pieces and puts them side by … Continue reading

Deploying your Mahout application as a webapp on Openshift

Openshift (a cloud computing platform from Redhat) does not at present support Hadoop so this is not a route to go down if you have the kind of data that requires map reduce. However it’s not a bad option if you’re just playing with Mahout (see the previous post) and would like to share what … Continue reading

Visualising Shrinkage

A useful property of mixed effects and Bayesian hierarchical models is that lower level estimates are shrunk towards the more stable estimates further up the hierarchy. To use a time honoured example you might be modelling the effect of a new teaching method on performance at the classroom level. Classes of 30 or so students … Continue reading

Blog Stats

  • 308,436 hits

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 503 other followers


Get every new post delivered to your Inbox.

Join 503 other followers