//
you're reading...
Marketing Mix Lab, R

Marketing Mix Lab: Visualising The Correlation Matrix

Following on from the previous post here is an R function for visualising correlations between the explanatory variables in your data set.

# *--------------------------------------------------------------------
# | FUNCTION: visCorrel
# | Creates an MDS plot where the distance between variables represents
# | correlation between the variables (closer=more correlated)
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change                
# |     01  |05/01/2012|Simon Raper |first version.      
# *--------------------------------------------------------------------
# | INPUTS:  dataset        A dataframe containing only the explanatory
# |                         variables. It should not contain missing 
# |                         values or variables where there is no
# |                         variation
# |          abr            The number of characters used when
# |                         abbreviating the variables. If set to zero
# |                         there is no abbreviation
# |          method         The options are 'metric' or 'ordinal'. The 
# |                         default is metric
# *--------------------------------------------------------------------
# | OUTPUTS: graph          An MDS plot where the similarity measure is
# |                         correlation between the variables.
# |
# *--------------------------------------------------------------------
# | USAGE:         vis_correl(dataset, 
# |                           abr)
# |
# *--------------------------------------------------------------------
# | DEPENDS: ggplot2, directlabels
# |
# *--------------------------------------------------------------------
# | NOTES:   For more information about MDS please see 
# |          http://en.wikipedia.org/wiki/Multidimensional_scaling
# |
# *--------------------------------------------------------------------

visCorrel<-function(dataset, abr, method="metric"){
  
  #Create correlation matrix
  cor_ts<-cor(dataset)
  n<-dim(cor_ts)[2]
  
  # Create dissimilarities
  ones<-matrix(rep(1,n^2), nrow=n)
  dis_ts<-ones-abs(cor_ts)
  
  # Do MDS
  
  
  if ( method=="ordinal"){
      
    fit <- isoMDS(dis_ts, k=2)$points
    
  } else {
    
    cmd.res <- cmdscale(dis_ts, k=2, eig=TRUE)
    eig<-cmd.res$eig
    fit<-cmd.res$points
    prop<-sum(abs(eig[1:2]))/sum(abs(eig))
    
    print(paste("Proportion of squared distances represented:", round(prop*100)))
    
    if(prop<0.5){print("Less than 50% of squared distance is represented. Consider using ordinal scaling instead")}
    
  }
    
  
  x <- fit[,1]
  y <- fit[,2]
  
  labels<-row.names(cor_ts)
  if (abr>0){labels<-substr(labels,1,abr)}
  
  mds_plot<-data.frame(labels, x, y)
  
  #Plot the results
  g<-ggplot(mds_plot, aes(x=x, y=y, colour=labels, main="MDS Plot of Correlations"))+geom_point() + coord_fixed()+ opts(title ="MDS Plot of Correlations")
  direct.label(g, first.qp)
}

# *--------------------------------------------------------------------
# * Examples
# *--------------------------------------------------------------------
# visCorrel(midwest[,4:27],10, method="classical")
# visCorrel(midwest[,4:27],10, method="ordinal")
# visCorrel(Crime[,-c(1,2,11,12)],10)

An interesting example is the North Carolina Crime data set that comes with the plm package. This has the following continuous variables:

crmrte crimes committed per person
prbarr probability of arrest
prbarr probability of arrest
prbconv probability of conviction
prbpris probability of prison sentence
avgsen average sentence, days
polpc police per capita
density people per square mile
taxpc tax revenue per capita
pctmin percentage minority in 1980
wcon weekly wage in construction
wtuc weekly wage in trns, util, commun
wtrd weekly wage in whole sales and retail trade
wfir weekly wage in finance, insurance and real estate
wser weekly wage in service industry
wmfg weekly wage in manufacturing
wfed weekly wage of federal emplyees
wsta weekly wage of state employees
wloc weekly wage of local governments employees
mix offence mix: face-to-face/other
pctymle percentage of young males

Which is then visualised (using the ordinal option – see below) as this:

The closer the variables are on the plot the more highly correlated (positively or negatively) they are. Here we can see some interesting relationships. Unsurprisingly the wage variables form a correlated group. Towards the top of the chart the method correctly identifies three variables, probability of prison sentence, police per capita and offence mix that are all correlated with one another.

The plot is useful in dealing with multicollinearity as it allows us to spot clusters of correlated variables. For example a set of economic variables might be highly correlated with one another but have a low level of correlation with TV advertising levels. Why is this good to know? Because multicollinearity only affects those variables between which the relationship of collinearity holds. So if we are only interested in obtaining an accurate estimate for the TV advertising level then we need not be concerned about collinearity among the economic variables. Of course this plot deals just with correlation between two variables rather than with full blown collinearity but it’s a good place to start.

The plot works by performing a multi-dimensional scaling on the absolute value of the correlations. In other words if we think of correlation as measure of distance (highly correlated variables being closer together) it finds the best possible two dimensional representation of those distances. Note if these distances could be represented in a euclidean space then this would be equivalent to a plot of the first two dimensions in a principal components analysis. However correlations cannot be represented in this way.

Beware though, the best representation isn’t always a good representation. I’ve included some output which will tell you what proportion on the squared distances is represented in the plot. If it is low I would recommend trying ordinal scaling (the other method option) instead.

I’ve a feeling that applying multi-dimensional scaling to correlations is equivalent to something else that is probably far simpler but I haven’t had time to give it attention. If anyone knows I’d be very interested.

About these ads

About simonraper

I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. As Data Scientist at Channel 4 my role is to develop machine learning solutions that allow the channel to build a deeper relationship with the viewer and innovate the way advertising is traded and work on supporting the creative side of the business. My current interests are in scalable machine learning (Mahout, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 270 K visits and was mentioned in Flowing Data, I09, and the online editions of The New York Times and The New Yorker.

Discussion

8 thoughts on “Marketing Mix Lab: Visualising The Correlation Matrix

  1. I wonder how it would compare to a Dendogram of the same data.

    Posted by Robert Stratton | January 23, 2012, 5:26 pm
    • Thanks Robert. What I’d like to do if I get time is overlay the minimum spanning tree on the chart. That will show the structure of the dendrogram (for single linkage clustering) and if I label the links we can see how much the 2 dimensional plot distorts the true distances.

      Posted by simonraper | January 23, 2012, 8:35 pm
      • Nice piece of code!
        Here’s a modified version of your function: I’ve added the links between nodes defined by the correlation. More correlation is indicated by a more pronounced link. Check it out here: http://t.co/ygxrAbVw.

        And the code:
        vis_correl2 <-function(dataset, abr=0, minCorr=0){

        #Create correlation matrix
        cor_ts<-cor(dataset)
        n<-dim(cor_ts)[2]
        # Create dissimilarities
        ones<-matrix(rep(1,n^2), nrow=n)
        dis_ts<-ones-abs(cor_ts)
        # Do MDS
        fit <- cmdscale(dis_ts, k=2)
        x <- fit[,1]
        y <- fit[,2]
        labels0){labels<-substr(labels,1,abr)}

        mds_plot<-data.frame(labels, x, y)

        ### define the edges
        mds_plot <- cbind(id = 1:nrow(mds_plot), mds_plot)
        ### building the appropriate edge representation
        edge_weight <- as.data.frame.table(cor_ts)
        edge_weight <- merge(edge_weight,mds_plot[,1:2], by.x="Var2", by.y="labels" )
        edge_weight <- merge(edge_weight,mds_plot[,1:2], by.x="Var1", by.y="labels" )
        ### build the edges in coordinate form
        edge_weight_coord <- data.frame(mds_plot[edge_weight[,4],], mds_plot[edge_weight[,5],], edge_weight$Freq)
        rownames(edge_weight_coord) <- NULL
        colnames(edge_weight_coord) <- c("id1", "labels1", "X1","Y1","id2", "labels2", "X2","Y2", "Corr")
        ###
        edge_weight_coord edge_weight_coord$id2,]

        ### restric to positive corr
        edge_weight_coord minCorr, ]

        #Plot the results
        ggplot() +
        geom_segment(aes(x=X1, y=Y1, xend = X2, yend = Y2, alpha = Corr), data=edge_weight_coord, colour=”gray”, size=1) +
        geom_text(aes(x=x, y=y, label=labels, main=”MDS Plot of Correlations”), data= mds_plot, size=4, angle=0) + coord_fixed() + opts(title =”MDS Plot of Correlations”) +
        geom_point(aes(x=x, y=y, label=labels, main=”MDS Plot of Correlations”), data= mds_plot,size = 5, colour=”red”, alpha=0.3) +
        scale_x_continuous(breaks = NULL) + scale_y_continuous(breaks = NULL) +
        # discard default grid + titles in ggplot2
        opts(panel.background = theme_blank()) + opts(legend.position=”none”)+
        opts(axis.title.x = theme_blank(), axis.title.y = theme_blank()) +
        opts( legend.background = theme_rect(colour = NA)) +
        opts(panel.background = theme_rect(fill = “white”, colour = NA)) +
        opts(panel.grid.minor = theme_blank(), panel.grid.major = theme_blank())
        }

        Posted by cafe876 (@cafe876) | April 6, 2012, 8:55 am
      • That’s fantastic. Thank you so much!

        Posted by simonraper | April 6, 2012, 8:43 pm
  2. Having used this a few times I’ve now realised quite how non-euclidean some correlation matrices can be. This means that sometimes metric scaling does a poor job of representing the distances. I’ve done two things to allow for this. 1) provided output that shows what proportion of the distances are represented in the plot. 2) provided the option to do ordinal scaling instead as it often does a much better job if the distances are a long way from Euclidean. I’ve updated the crime data plot as it is better represented with the ordinal scaling. One other thing: I’ve used the directlabels package to add the labels as it does a better job and looks nicer.

    Posted by simonraper | April 20, 2012, 3:36 pm
  3. Hello!
    I’m a beginner in R. How can I adapt this script to my own dataset. What line(s) should I replace?

    Posted by Ranaivoson | May 10, 2012, 1:27 pm
    • Hi there,

      all you need to do is load the function into memory by running it. Then you can run the line visCorrel(data, x) where data is your dataframe of explanatory variables and x is the number of places you want the column names truncated to. You’ll need to install the packages ggplot2 and directlabels for it to work. Also see my note above about trying both metric and ordinal scaling.

      Simon

      Posted by simonraper | May 11, 2012, 11:00 am
  4. OK! Thank you!

    Posted by Ranaivoson | May 13, 2012, 6:52 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog Stats

  • 287,905 hits

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 480 other followers

Follow

Get every new post delivered to your Inbox.

Join 480 other followers

%d bloggers like this: