//
archives

simonraper

I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. As Data Scientist at Channel 4 my role is to develop machine learning solutions that allow the channel to build a deeper relationship with the viewer and innovate the way advertising is traded and work on supporting the creative side of the business. My current interests are in scalable machine learning (Mahout, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 270 K visits and was mentioned in Flowing Data, I09, and the online editions of The New York Times and The New Yorker.
simonraper has written 20 posts for Drunks&Lampposts

Quick start Hadoop and Hive for analysts

Here’s another post from my new website ragscripts.com. The problem You have a huge data set to crunch in a small amount of time. You’ve heard about the advantages of map reduce for processing data in parallel and you’d quite like to have a go with Hadoop to see what all the fuss is about. … Continue reading

Include uncertainty in a financial model

Here’s a post that appears on my new website, ragscripts.com. On-line resources for analysts are often either too general to be of practical use or too specialised to be accessible. The aim of ragscripts.com is to remedy this by providing start to finish directions for complex analytical tasks. The site is under construction at the … Continue reading

Freehand Diagrams with Adobe Ideas

Freehand diagrams have two big virtues: they are quick and they are unconstrained. I used to use a notebook (see What are degrees of freedom) but recently I got an ipad and then I found Adobe Ideas. It’s completely free and has just the right level of complexity for getting ideas down fast. It takes … Continue reading

A confused tangle

A confusion matrix is a confusing thing. There’s a surprising number of useful statistics that can be built out of just four numbers and the links between them are not always obvious. The terminology doesn’t help (is a true negative an observation that is truly in the class but classified negative or one that is … Continue reading

Box Me

Here’s a short R function I wrote to turn a long data set into a wide one for viewing. It’s not the most exciting function ever but I find it quite useful when my screen is wide and short. It simply cuts the data set horizontally into equal size pieces and puts them side by … Continue reading

Deploying your Mahout application as a webapp on Openshift

Openshift (a cloud computing platform from Redhat) does not at present support Hadoop so this is not a route to go down if you have the kind of data that requires map reduce. However it’s not a bad option if you’re just playing with Mahout (see the previous post) and would like to share what … Continue reading

Visualising Shrinkage

A useful property of mixed effects and Bayesian hierarchical models is that lower level estimates are shrunk towards the more stable estimates further up the hierarchy. To use a time honoured example you might be modelling the effect of a new teaching method on performance at the classroom level. Classes of 30 or so students … Continue reading

Book Recommendations from Beyond the Grave: A Mahout Example

In H P Lovecraft’s The Case of Charles Dexter Ward the villainous Curwen, having taken possession of the body of Charles Dexter Ward, uses a combination of chemistry and black magic to bring back from the dead the wisest people who have ever lived. He then tortures them for their secrets. Resurrection of the dead … Continue reading

What are degrees of freedom?

I remember getting frustrated as an undergraduate trying to find straight answer to this question. The standard text book answer is something like this: “In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary” That’s from Wikipedia but it’s fairly … Continue reading

Mahout for R Users

I have a few posts coming up on Apache Mahout so I thought it might be useful to share some notes. I came at it as primarily an R coder with some very rusty Java and C++ somewhere in the back of my head so that will be my point of reference. I’ve also included … Continue reading

Blog Stats

  • 287,691 hits

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 480 other followers

Follow

Get every new post delivered to your Inbox.

Join 480 other followers